Gaussian Graphical Models (GGMs) are widely used for exploratory data analysis in various fields such as genomics, ecology, psychometry. In a high-dimensional setting, when the number of variables exceeds the number of observations by several orders of magnitude, the estimation of GGM is a difficult and unstable optimization problem. Clustering of variables or variable selection is often performed prior to GGM estimation. We propose a new method allowing to simultaneously infer a hierarchical clustering structure and the graphs describing the structure of independence at each level of the hierarchy. This method is based on solving a convex optimization problem combining a graphical lasso penalty with a fused type lasso penalty. Results on real and synthetic data are presented.
Probabilistic graphical models (Lauritzen 1996; Koller and Friedman 2009) are widely used in high-dimensional data analysis to synthesize the interaction between variables. In many applications, such as genomics or image analysis, graphical models reduce the number of parameters by selecting the most relevant interactions between variables. Undirected Gaussian Graphical Models (GGMs) are a class of graphical models used in Gaussian settings. In the context of high-dimensional statistics, graphical models are generally assumed sparse, meaning that a small number of variables interact, compared to the total number of possible interactions. This assumption has been shown to provide both statistical and computational advantages by simplifying the structure of dependence between variables (Dempster 1972) and allowing efficient algorithms (Meinshausen and Bühlmann 2006). See, for instance, (Fan, Liao, and Liu 2016) for a review about sparse graphical models inference.
In GGMs, it is well known (see, e.g., Lauritzen (1996)) that inferring the graphical model or equivalently the Conditional Independence Graph (CIG) boils down to inferring the support of the precision matrix \mathbf{\Omega} (the inverse of the variance-covariance matrix). To do so, several \ell_1 penalized methods have been proposed in the literature to learn the CIG of GGMs. For instance, the neighborhood selection (Meinshausen and Bühlmann 2006) (MB) based on a nodewise regression approach via the least absolute shrinkage and selection operator (Lasso, Tibshirani (1996)) is a popular method. Each variable is regressed on the others, taking advantage of the link between the so-obtained regression coefficients and partial correlations. More precisely, for all 1 \le i \le p, the following problem is solved:
In Equation 1, \lambda is a non negative regularization parameter and \mathbf{X}^{\setminus i} denotes the matrix \mathbf{X} deprived of column i. The MB method defined by the estimation problem in Equation 1 has generated a long line of work in the field of nodewise regression methods. For instance, Rocha, Zhao, and Yu (2008), Ambroise, Chiquet, and Matias (2009) showed that nodewise regression could be seen as a pseudo-likelihood approximation, and Peng et al. (2009) extended the MB method to estimate sparse partial correlations using a single regression problem. Other inference methods similar to nodewise regression include a method based on Dantzig selector (Yuan 2010) and the introduction of the Clime estimator (Cai, Liu, and Luo 2011).
Such inference methods are widely used and enjoy many favorable theoretical and empirical properties, including robustness to high-dimensional problems. However, some limitations have been observed, particularly in the presence of strongly correlated variables. These limitations are caused by known impairments of Lasso-type regularization in this context (Bühlmann et al. 2012; Park, Hastie, and Tibshirani 2006). To overcome this, in addition to sparsity, several previous works attempt to estimate CIG by integrating clustering structures among variables for the sake of both statistical sanity and interpretability. A non-exhaustive list of works that integrate a clustering structure to speed up or improve the estimation procedure includes (Honorio et al. 2009; Ambroise, Chiquet, and Matias 2009; Mazumder and Hastie 2012; Tan, Witten, and Shojaie 2013; Yao and Allen 2019; Devijver and Gallopin 2018).
The above methods exploit the group structure to simplify the graph inference problem and infer the CIG between single variables. Another question that has received less attention is the inference of the CIG between the groups of variables, i.e., between the meta-variables representative of the group structure. A recent work introducing inference of graphical models on multiple grouping levels is (Cheng, Shan, and Kim 2017). They proposed inferring the CIG of gene data on two levels corresponding to genes and pathways, respectively. Note that pathways are groups of functionally related genes known in advance. The inference is achieved by optimizing a penalized maximum likelihood that estimates a sparse network at both gene and group levels. Our work is also part of this dynamic. We introduce a penalty term allowing parsimonious networks to be built at different hierarchical clustering levels. The main difference with the procedure of (Cheng, Shan, and Kim 2017) is that we do not require prior knowledge of the group structure, which makes the problem significantly more complex. In addition, our method has the advantage of proposing CIGs at more than two levels of granularity.
We introduce the Multiscale Graphical Lasso (MGLasso), a novel method to estimate simultaneously a hierarchical clustering structure, and graphical models depicting the conditional independence structure between clusters of variables at each level of the hierarchy. The procedure is based on a convex optimization problem with a hybrid penalty term combining a graphical Lasso and a group-fused Lasso penalty. In the spirit of convex hierarchical clustering, introduced by (Hocking et al. 2011; Lindsten, Ohlsson, and Ljung 2011), the hierarchy is obtained by spanning the entire regularization path. At each level of the hierarchy, variables in the same clusters are represented by a meta-variable; the new CIG is estimated between these new meta-variables, leading to a multiscale graphical model. Unlike (Yao and Allen 2019), who introduced convex clustering in GGMs, our approach is expected to produce sparse estimations, thanks to an additional \ell_1 penalty.
The remainder of this paper is organized as follows. In section Multiscale Graphical Lasso, we formally introduce the Multiscale Graphical Lasso based on a convex estimation problem and an optimization algorithm based on the continuation of Nesterov’s smoothing technique (Hadj-Selem et al. 2018). Section Simulation experiments presents numerical results on simulated and real data.
2 Multiscale Graphical Lasso
The proposed method aims at inferring a graphical Gaussian model while hierarchically grouping variables. It infers conditional independence between different groups of variables. The approach is based on neighborhood selection (Meinshausen and Bühlmann 2006) and considers an additional fused-Lasso type penalty for clustering. In the spirit of hierarchical convex clustering, the hierarchical structure is recovered by spanning the regularization path.
Let X = (X^1, \dots, X^p)^T \in \mathbb R^p be a p-dimensional Gaussian random vector, with mean vector \mu and covariance matrix \mathbf \Sigma. The conditional independence structure of X is characterized by a graph G = (V, E), where V = \{1,\ldots p\} is the set of variables and E the set of edges, uniquely determined by the support of the precision matrix \mathbf{\Omega} = \mathbf{S}^{-1} (see, e.g., Dempster (1972)). In other words, for any two vertices i,j\in V, the edge (i,j) belongs to the set E if and only if \Omega_{ij} \neq 0, that is if and only if the i-th and j-th variables are conditionally independent given all the others i.e. X^i \perp \!\!\! \perp X^j |X^{\setminus (i, j)} where X^{\setminus (i, j)} is the set of all p variables deprived of variables i and j.
Considering the linear model X^i = \sum_{j\neq i} \beta^i_j X_j + \epsilon_i where \epsilon_i is a Gaussian centered random variable, we have \beta^i_j = -\frac{\Omega_{ij}}{\Omega_{ii}}. We define the regression matrix \boldsymbol{\beta} := [{\boldsymbol{\beta}^1}^T, \ldots, {\boldsymbol{\beta}^p}^T]^T \in \mathbb{R}^{p \times (p-1)}, whose rows are the regression vectors for each of the p regressions.
Let the n \times p-dimensional matrix \mathbf{X} contain n independent observations of X. We propose to minimize the following criterion which combines Lasso and group-fused Lasso penalties:
where \tau_{ij} is a permutation exchanging the coefficients \boldsymbol{\beta}^j_j and \boldsymbol{\beta}^j_i and leaves other coefficients untouched, \mathbf{X}^{i}\in \mathbb{R}^n denotes the i-th column of \mathbf{X}, \boldsymbol{\beta}_{i} denotes the i-th row of \beta, \lambda_1 and \lambda_2 are penalization parameters. Let us consider
The lasso penalty term encourages sparsity and the penalty term \left \lVert \boldsymbol{\beta}^i - \tau_{ij}(\boldsymbol{\beta}^j) \right \rVert_2 encourages to fuse regression vectors \boldsymbol{\beta}^i and \boldsymbol{\beta}^j. These fusions uncover a clustering structure. The model is likely to cluster together variables that have the same conditional effects on the others. Variables X^i and X^j are assigned to the same cluster when \boldsymbol{\beta}^i = \tau_{ij}(\boldsymbol{\beta}^j).
Let us illustrate by an example the effect of the proposed approach. If we consider a group of q variables whose regression vectors have at least q non-zero coefficients and further assume that for each pair of group variables i and j, \|\boldsymbol{\beta}^i - \tau_{ij} (\boldsymbol{\beta}^j)\|_2=0. After some permutations, we get a q\times q block of non-zeros coefficient \beta_{ij} corresponding to the group in the \boldsymbol{\beta} matrix, where (i,j)\in \{1,\cdots, q\}^2. If we consider three different indices i,j,k \ \in \{1,\cdots, q\}^3, it is straightforward to show that the six coefficients indexed by (i,j,k) are all equal. Thus the distance constraints between vectors \boldsymbol{\beta}^i of a group forces equality of all regression coefficients in the group.
Let us illustrate by an example the effect of the proposed approach. Two variables i and j are in the same group when \|\boldsymbol{\beta}^i - \tau_{ij} (\boldsymbol{\beta}^j)\|_2=0. Considering a cluster \mathcal C of q variables, it is straightforward to show that \forall (i,j) \in \mathcal C^2, we have \beta_{ij}=\beta_{\mathcal C}. Thus the algorithm tend to produce precision matrices with blocks of constant entries for a given value of \lambda_2. Each block corresponds to a cluster.
Let us illustrate by an example the effect of the proposed approach. If we consider a group of q variables whose regression vectors have at least q non-zero coefficients and further assume that for each pair of group variables i and j, \|\boldsymbol{\beta}^i - \tau_{ij} (\boldsymbol{\beta}^j)\|_2=0. After some permutations, we get a q\times q block of non-zeros coefficient \beta_{ij} corresponding to the group in the \boldsymbol{\beta} matrix, where (i,j)\in \{1,\cdots, q\}^2. If we consider three different indices i,j,k \ \in \{1,\cdots, q\}^3, it is straightforward to show that the six coefficients indexed by (i,j,k) are all equal. Thus the distance constraints between vectors \boldsymbol{\beta}^i of a group forces equality of all regression coefficients in the group.
The greater the regularization weight \lambda_2, the larger groups become. This is the core principle of the convex relaxation of hierarchical clustering introduced by Hocking et al. (2011). Hence, we can derive a hierarchical clustering structure by spanning the regularization path obtained by varying \lambda_2 while \lambda_1 is fixed. The addition of a fused-type term in graphical models inference has been studied previously by authors such as Honorio et al. (2009), Ganguly and Polonik (2014), Grechkin et al. (2015). However, these existing methods require prior knowledge of the neighborhood of each variable. On the contrary, our approach allows simultaneous inference of a multi-level graphical model and a hierarchical clustering of the variables.
In practice, if some information about the clustering structure is available, the problem can be generalized into:
where w_{ij} is a positive weight encoding prior knowledge of the groups to which variables i and j belong to. A good choice for these pairwise affinities can help save time in computation and improve clustering performances. The choice can be guided by the application however improper specification can give rise to difficulty interpretable clustering structures (Chakraborty and Xu 2020). In the remainder of the paper, we will consider assume that w_{ij} = 1 for simplicity.
3 Numerical scheme
This section introduces a complete numerical scheme to apply MGLasso in practice, using a convex optimization algorithm and a model selection procedure. Section Optimization via CONESTA algorithm reviews the principles of the Continuation with Nesterov smoothing in a shrinkage-thresholding algorithm (CONESTA, Hadj-Selem et al. (2018)), the optimization algorithm used in practice to solve MGLasso. Section Reformulation of MGLasso for CONESTA algorithm details a reformulation of MGLasso, which enables us to apply CONESTA. Finally, Section Model selection presents the procedure used to select the regularization parameters.
3.1 Optimization via CONESTA algorithm
The optimization problem for Multiscale Graphical Lasso is convex but not straightforward to solve using classical algorithms because of the fused-lasso type penalty, which is non-separable and admits no closed-form solution for the proximal gradient. We rely on the Continuation with Nesterov smoothing in a shrinkage-thresholding algorithm (CONESTA, Hadj-Selem et al. (2018)), dedicated to high-dimensional regression problems with structured sparsity such as group structures.
The CONESTA solver, initially introduced for neuro-imaging problems, addresses a general class of convex optimization problems which includes group-wise penalties. admitting loss functions of the form: The algorithm solves problems in the form f(\boldsymbol{\theta} ) = g(\boldsymbol{\theta}) + \lambda_1 h(\boldsymbol{\theta}) + \lambda_2 s(\boldsymbol{\theta}),
with parameters vector \boldsymbol{\theta}\in \mathbb{R}^d where \lambda_1 and \lambda_2 are penalty parameters.
where \lambda_1 and \lambda_2 are penalty weights, and \boldsymbol{\theta}\in \mathbb{R}^d is a d-dimensional vector of parameters to optimize.
In the original paper (Hadj-Selem et al. 2018), the function g(\boldsymbol{\theta}) is the sum of a least squares criterion and a ridge penalty a differentiable function, h(\boldsymbol{\theta}) is a penalty function whose proximal operator \operatorname{prox}_{\lambda_1 h} is known in closed-form.
Given \phi \subseteq \{1,\ldots, d\}, let \boldsymbol{\theta}_\phi = (\theta_i)_{i \in \phi} denote the subvector of \boldsymbol{\theta} referenced by the indices in \phi. Denote \Phi = \{ \phi_1, \dots, \phi_{\operatorname{Card}(\Phi)}\} a collection with \phi_i \subseteq \{1,\ldots, d\}. Let the matrix \mathbf{A}_\phi \in \mathbb{R}^{m \times \operatorname{Card}(\Phi) } define a linear map from \mathbb{R}^{\operatorname{Card}(\phi)} to \mathbb{R}^m by sending the column vector \boldsymbol{\theta}_\phi \in \mathbb{R}^{\operatorname{Card}(\phi)} to the column vector \mathbf{A}_\phi \boldsymbol{\theta}_\phi \in \mathbb{R}^m. The function s(\boldsymbol{\theta}) is assumed to be an \ell_{1,2}-norm i.e. the sum of the group-wise \ell_2-norms of the elements \mathbf{A}_\phi \boldsymbol{\theta}_\phi, \phi \in \Phi. Namely,
and s(\boldsymbol{\theta}) is an \ell_{1,2} penalty of the form
In the definition of s(\boldsymbol{\theta}), \Phi = \{ \phi_1, \dots, \phi_{\operatorname{Card}(\Phi)}\} is a set of subsets of indices, i.e., \Phi_i\subset \{1,\ldots, d\} for all i\in\{1,\ldots,\operatorname{Card}(\Phi)\} and, for all \phi\in \Phi, \boldsymbol{\theta}_\phi is the sub-vector of \boldsymbol{\theta} defined by \boldsymbol{\theta}_\phi = (\theta_i)_{i\in\phi}. Finally, \mathbf{D}_\phi are linear operators. <>
When \mathbf{A}_\phi is the identity operator such as \mathbf{A}_\phi \boldsymbol{\theta}_\phi = \boldsymbol{\theta}_\phi, the penalty function s is the overlapping group-lasso and m = \operatorname{Card}(\phi). When it is a discrete derivative operator, therefore s is a total variation penalty and m can be seen as the number of neighborhood relationships.
The main ingredient of CONESTA is the approximation of The non-smooth \ell_{1,2}-norm penalty with unknown proximal gradient , can be approximated by a smooth function with known proximal gradient computed using Nesterov’s smoothing (Nesterov 2005). Given a smoothness parameter \mu>0, let us define the smooth approximation
s_{\mu}(\boldsymbol{\theta}) = \max_{\boldsymbol{\alpha} \in \mathcal{K}} \left \{ \boldsymbol{\alpha}^T \mathbf{A} \boldsymbol{\theta} - \frac{\mu}{2} \| \boldsymbol{\alpha} \|_2^2 \right \},
where \mathcal{K} is the cartesian product of \ell_2-unit balls, \mathbf{A} is the vertical concatenation
of the matrices \mathbf{A}_\phi and \boldsymbol{\alpha} is an auxiliary variable resulting from the dual reformulation of s(\boldsymbol{\theta}) . Note that \lim_{\mu \rightarrow 0} s_{\mu}(\boldsymbol{\theta}) = s(\boldsymbol{\theta}). A Fast Iterative Shrinkage-Thresholding Algorithm (FISTA, Beck and Teboulle (2009)) step can then be applied after computing the gradient of the smooth part i.e. g(\boldsymbol{\theta}) + \lambda_2 s_{\mu}(\boldsymbol{\theta}) of the approximated criterion.
The main ingredient of CONESTA remains in the the determination of the optimal smoothness parameter using the duality gap, which minimizes the number of FISTA iterations for a given precision \epsilon. The specification of \mu is subject to dynamic update. A sequence of decreasing optimal smoothness parameters is generated in order to dynamically adapt the FISTA algorithm stepsize towards \epsilon. Namely, \mu^k = \mu_{opt}(\epsilon^k). The smoothness parameter decreases as one get closer to \boldsymbol{\theta} ^\star, the solution of the problem defined in Equation 4. Since \boldsymbol{\theta} ^\star is unknown, the approximation of the distance to the minimum is achieved via the duality gap. Indeed
We refer the reader to the seminal paper for more details on the formulation of \operatorname{GAP}_{\mu^k}(\boldsymbol{\theta}^k). The CONESTA routine is spelled out in the algorithm CONESTA solver where L(g + \lambda_2 s_{\mu}) is the Lipschitz constant of \nabla(g + \lambda_2 s_{\mu}). At each iteration of the CONESTA algoithm, the smoothing parameter \mu is updated dynamically using the duality gap, and a new approximation is computed. The CONESTA algorithm enjoys a linear convergence rate, and was shown empirically to outperform other computational options for structured-sparsity problems such as ADMM and inexact FISTA in terms of convergence speed (Hadj-Selem et al. 2018).
3.2 Reformulation of MGLasso for CONESTA algorithm
To apply CONESTA, it is necessary to reformulate the MGLasso problem in order to comply with the form of loss function required by CONESTA. The objective of MGLasso can indeed be written as
where \mathbf{Y} = \operatorname{Vec}(\mathbf{X}) \in \mathbb{R}^{np}, \tilde{\boldsymbol{\beta}} = \operatorname{Vec(\boldsymbol{\beta})} \in \mathbb{R}^{p (p-1)}, \tilde{\mathbf{X}} is a \mathbb{R}^{[np]\times [p \times (p-1)]} block-diagonal matrix with \mathbf{X}^{\setminus i} on the i-th block. The matrix \boldsymbol A_{ij} is a (p-1)\times p(p-1) matrix defined by chosen so that \boldsymbol D_{ij} \tilde{\boldsymbol{\beta}} = \boldsymbol{\beta}^i - \tau_{ij}(\boldsymbol{\beta}^j).
Note that we introduce this notation for simplicity of exposition, but, in practice, the sparsity of the matrices \boldsymbol D_{ij} allows a more efficient implementation. Based on reformulation Equation 5, we may apply CONESTA to solve the objective of MGLasso for fixed \lambda_1 and \lambda_2. The procedure is applied, for fixed \lambda_1, to a range of decreasing values of \lambda_2 to obtain a hierarchical clustering. The corresponding pseudo-code is given in the following algorithm where (\mathbf{X}^i)^{\dagger} denotes the pseudo-inverse of \mathbf{X}^i and \epsilon_{fuse} the threshold for merging clusters.
\begin{algorithm}
\caption{MGLasso algorithm}
\begin{algorithmic}
\STATE \textbf{Inputs}: \\
$\quad$ Set of variables $\mathbf{X} = \{\mathbf{X}^1, \dots, \mathbf{X}^p \} \in \mathbb R^{n\times p}$ \\
$\quad$ Penalty parameters $\lambda_1 \ge 0, {\lambda_2}_{\operatorname{initial}} > 0$ \\
$\quad$ Increasing factor $\eta > 1$ for fusion penalties $\lambda_2$\\
$\quad$ Fusion threshold $\epsilon_{fuse} \ge 0$
\STATE \textbf{Outputs:} For $\lambda_1$ fixed and $\lambda_2$ from $0$ to ${\lambda_2}_{\operatorname{initial}} \times \eta^{(I)}$ with $I$ the number of iterations: \\
$\quad$ Regression vectors $\boldsymbol{\beta}(\lambda_1, \lambda_2) \in \mathbb R^{p \times (p-1)}$, \\
$\quad$ Clusters of variables indices $C(\lambda_1, \lambda_2) =\{ C_k \}_{(\lambda_1, \lambda_2)}, k=1, \dots, K$
\STATE \textbf{Initializations:} \\
$\quad$ $\boldsymbol{\beta}^i = (\mathbf{X}^i)^{\dagger}\mathbf{X}^i$, $\forall i = 1, \dots, p$ for warm start in CONESTA solver \\
$\quad$ $C = \left \{\{1\}, \dots, \{p\}\right \}$ Initial clusters with one element per cluster. \\
$\quad$ Set $\lambda_2 = 0$ \\
$\quad$ Compute $\boldsymbol{\beta}$ using CONESTA solver (Equivalent to Meinshausen-Bühlmann approach) \\
$\quad$ Update clusters $C$ with rule described in \textbf{while} loop.
\STATE \textbf{Set:} $\lambda_2 = {\lambda_2}_{\operatorname{initial}}$ \\
\COMMENT{Clustering path}
\WHILE{$\operatorname{Card}(C) > 1$}
\STATE Compute $\boldsymbol{\beta}$ using CONESTA solver with warm start from previous iteration \\
\COMMENT{Clusters update}
\STATE Compute pairwises distances $d(i,j)=\left \lVert \boldsymbol{\beta}^i - \tau_{ij}(\boldsymbol{\beta}^j) \right \rVert_2$, $\forall i,j \in \{1, \dots, p\}$ \\
\STATE Determine clusters $C_k (k=1, \dots, K)$ with the rule $(i,j) \in C_k$ iff. $d(i,j) \le \epsilon_{fuse}$
\STATE $\lambda_2 = \lambda_2 \times \kappa$
\ENDWHILE
\end{algorithmic}
\end{algorithm}
3.3 Model selection
A crucial question for practical applications is the definition of a rule to select the penalty parameters (\lambda_1, \lambda_2). This selection problem operates at two levels: \lambda_1 controls the sparsity of the graphical model, and \lambda_2 controls the number of clusters in the optimal clustering partition. These two parameters are dealt with separately: the sparsity parameter \lambda_1 is chosen via model selection, while the clustering parameter \lambda_2 varies across a grid of values, in order to obtain graphs with different levels of granularity. The problem of model selection in graphical models is difficult in the high dimensional case where the number of samples is small compared to the number of variables, as classical AIC and BIC criteria tend to perform poorly (Liu, Roeder, and Wasserman 2010). Alternative criteria have been proposed in the literature, such as cross-validation (Bien and Tibshirani 2011), GGMSelect (Giraud, Huet, and Verzelen 2012), stability selection (Meinshausen and Bühlmann 2010; Liu, Roeder, and Wasserman 2010), Extended Bayesian Information Criterion (EBIC) (Foygel and Drton 2010), and Rotation Information Criterion (Zhao et al. 2012).
In this paper, we focused on the StARS stability selection approach proposed by Liu, Roeder, and Wasserman (2010). The method uses k subsamples of data to estimate the associated graphs for a given range of \lambda_1 values. For each value, a global instability of the graph edges is computed. The optimal value of \lambda_1 is chosen so as to minimize the instability, as follows. Let \lambda^{(1)}_1, \dots, \lambda_1^{(K)} be a grid of sparsity regularization parameters, and S_1, \dots, S_N be N bootstrap samples obtained by sampling the rows of the data set \mathbf{X}. For each k\in\{1,\ldots,K\} and for each j\in\{1,\ldots, N\}, we denote by \mathcal{A}^{k,j}(\mathbf{X}) the adjacency matrix of the estimated graph obtained by applying the inference algorithm to S_n with regularization parameter \lambda_1^{(k)}. For each possible edge (s,t)\in\{1,\ldots,p\}^2, the probability of edge appearance is estimated empirically by
the empirical instability of edge (s,t) (that is, twice the variance of the Bernoulli indicator of edge (s,t)). The instability level associated to \lambda_1^{(k)} is given by
\hat D(\lambda_1^{(k)}) = \frac{\sum_{s<t} \hat \xi_{st}(\lambda_1^{(k)})}{p \choose 2},
StARS selects the optimal penalty parameter as follows
where \beta is the threshold chosen for the instability level.
4 Simulation experiments
In this section, we conduct a simulation study to evaluate the performance of the MGLasso method, both in terms of clustering and support recovery. Receiver Operating Characteristic (ROC) curves are used to evaluate the adequacy of the inferred graphs with the reality ground truth for the MGLasso and GLasso methods in the Erdös-Renyi, Scale-free, and Stochastic Block Models frameworks. The Adjusted Rand indices are used to compare the partitions obtained with MGLasso, hierarchical agglomerative clustering, and K-means clustering in a stochastic block model framework.
4.1 Synthetic data models
We consider three different synthetic network models: the Stochastic Block Model (SBM, Fienberg and Wasserman (1981)), the Erdös-Renyi model (Erdős, Rényi, et al. 1960) and the Scale-Free model (Newman, Strogatz, and Watts 2001). In each case, Gaussian data is generated by drawing n independent realizations of a multivariate Gaussian distribution \mathcal N(0, \mathbf{\Sigma}) where \mathbf{\Sigma} \in \mathbb{R}^{p \times p} and \mathbf{\Omega} = \mathbf{\Sigma} ^{-1}. The support of \mathbf{\Omega}, equivalent to the network adjacency matrix, is generated from the three different models. The difficulty level of the problem is controlled by varying the ratio \frac{n}{p} with p fixed at 40: \frac{n}{p}\in \{0.5,1,2\}.
4.1.1 Stochastic Block-Model
We construct a block-diagonal precision matrix \mathbf{\Omega} as follows. First, we generate the support of \mathbf{\Omega} as shown in Figure 1, denoted by \boldsymbol A\in\{0,1\}^{p\times p}. To do this, the variables are first partitioned into K = 5 hidden groups, noted C_1, \dots, C_K described by a latent random variable Z_i, such that Z_i = k if i = C_k. Z_i follows a multinomial distribution
P(Z_i = k) = \pi_k, \quad \forall k \in \{1, \dots, K\}, where \pi = (\pi_1, \dots, \pi_k) is the vector of proportions of clusters whose sum is equal to one. The set of latent variables is noted \mathbf{Z} = \{ Z_1, \dots, Z_K\}. Conditionally to \mathbf{Z}, A_{ij} follows a Bernoulli distribution such that
where \alpha_{kl} is the probability of inter-cluster connectivity, with \alpha_{kl} = 0.01 if k\neq l and \alpha_{ll} = 0,75. For k\in\{1,\ldots, K\}, we define p_k = \sum_{i=1}^p \boldsymbol{1}_{\{Z_i = k\}}. The precision matrix \mathbf{\Omega} of the graph is then calculated as follows. We define \Omega_{ij} = 0 if Z_i\neq Z_j ; otherwise, we define \Omega_{ij} = A_{ij}\omega_{ij} where, for all i\in\{1,\ldots,p\} and for all j\in\{1,\ldots,p| Z_j = Z_i\}, \omega_{ij} is given by :
If \alpha_{ll} were to be equal to one, this construction of \mathbf{\Omega} would make it possible to control the level of correlation between the variables in each block to \rho. Introducing a more realistic scheme with \alpha_{ll}=0.75 allows only to have an approximate control.
Show the code
bloc_diag <-function(n_vars, connectivity_mat, prop_clusters, rho) { true_nclusters <-0while (true_nclusters !=length(prop_clusters)){ ## To make sure we have the required number of clusters network <- simone::rNetwork(n_vars, connectivity_mat, prop_clusters) true_nclusters <-length(unique(network$clusters)) } precision_mat <- network$A[order(network$clusters), order(network$clusters)] eff_clusters <-table(network$clusters) b =-rho/(1+rho*(eff_clusters-2) -rho^2*(eff_clusters-1)) d = (1+rho*(eff_clusters-2))/(1+rho*(eff_clusters-2) -rho^2*(eff_clusters-1))for (i in1:length(eff_clusters)) { temp <- precision_mat[which(row.names(precision_mat) == i), which(row.names(precision_mat) == i)] temp <-as.matrix(temp) temp[temp !=0] = b[i]diag(temp) = d[i] precision_mat[which(row.names(precision_mat) == i), which(row.names(precision_mat) == i)] = temp } flag <-min(svd(precision_mat)$d)if(flag <0){diag(precision_mat) <-1- flag }return(precision_mat)}#' simulate data with given graph structuresim_data <-function(p =20,np_ratio =2,structure =c("block_diagonal", "hub", "scale_free", "erdos"), alpha, prob_mat, rho, g,inter_cluster_edge_prob =0.01,p_erdos =0.1,verbose =FALSE){ structure <-match.arg(structure) n =round(np_ratio * p)switch (structure,erdos = { network <- simone::rNetwork(p, p_erdos, 1) correlation <-solve(network$Theta) X <- mvtnorm::rmvnorm(n, sigma = correlation) graph <- network$Thetaif(verbose) message("Data, precision matrix and graph generated from Erdos structure.") },hub = { L = huge::huge.generator(graph ="hub", g =5, d = p, n = n, verbose =FALSE) X=L$data graph = L$omega correlation = L$sigmaif(verbose) message("Data, precision matrix and graph generated from Hub structure.") },scale_free = { L = huge::huge.generator(graph ="scale-free", d = p, n = n, verbose =FALSE) ## d edges graph X=L$data graph = L$omega correlation = L$sigmaif(verbose) message("Data, precision matrix and graph generated from Scale free structure.") },block_diagonal = { if(inter_cluster_edge_prob !=0) { # inter-clusters edges added in the flipped way flag <-TRUEwhile(flag) { K <-bloc_diag(p, prob_mat, alpha, rho) target_indices <-which(K[upper.tri(K)] ==0) # Random selection of edges to be set to 1 select_len <-round(length(target_indices) * inter_cluster_edge_prob) selected_indices <-sample(target_indices, select_len) precision_level <-unique(K[upper.tri(K)]) precision_level <-max(precision_level[precision_level !=0]) K[upper.tri(K)][selected_indices] <- precision_level K <-as.matrix(Matrix::forceSymmetric(K, uplo ="U")) flag <-any(eigen(K)$values <=0) # Control of positive definiteness } correlation <-solve(K) graph = K X <- mvtnorm::rmvnorm(n, sigma = correlation)if(verbose) message("Data, precision matrix and graph generated from block-diagonal structure with inter-clusters edges.") }else { # Only intra-cluster edges while approximately controlling correlation level K <-bloc_diag(p, prob_mat, alpha, rho) correlation <-solve(K) graph = K X <- mvtnorm::rmvnorm(n, sigma = correlation)if(verbose) message("Data, precision matrix and graph generated from block-diagonal structure with only intra-clusters edges.") } } )return(list(X=X, graph=graph, correlation=correlation))}adj_mat <-function(mat) {diag(mat) <-0 mat[ abs(mat) <1e-10] <-0 mat[mat !=0] <-1return(mat)}set.seed(2020)sim_sbm <-sim_data(p =40, structure ="block_diagonal", alpha =rep(1/5, 5), prob_mat =diag(0.75, 5), rho =0.2, inter_cluster_edge_prob =0.01)gsbm <-adj_mat(sim_sbm$graph)Matrix::image(as(gsbm, "sparseMatrix"),sub ="", xlab ="", ylab ="")
Figure 1: Adjacency matrix of a stochastic block model with 5 blocks.
4.1.2 Erdös-Renyi Model
The Erdös-Renyi model is a special case of the stochastic block model where \alpha_{kl} = \alpha_{ll} = \alpha is constant. We set the density \alpha of the graph to 0.1; see Figure 2 for an example of the graph resulting from this model.
Figure 2: Adjacency matrix of an Erdös-Renyi model.
4.1.3 Scale-free Model
The Scale-free Model generates networks whose degree distributions follow a power law. The graph starts with an initial chain graph of 2 nodes. Then, new nodes are added to the graph one by one. Each new node is connected to an existing node with a probability proportional to the degree of the existing node. We set the number of edges in the graph to 40. An example of scale-free graph is shown in Figure 3.
We compare the network structure learning performance of our approach to that of GLasso in its neighborhood selection version using ROC curves. In both GLasso and MGLasso, the sparsity is controlled by a regularization parameter \lambda_1; however, MGLasso admits an additional regularization parameter, \lambda_2, which controls the strength of convex clustering. To compare the two methods, in each ROC curve, we vary the parameter \lambda_1 while the parameter \lambda_2 (for MGLasso) is kept constant. We computed ROC curves for 4 different penalty levels for the \lambda_2 parameter; since GLasso does not depend on \lambda_2, the GLasso ROC curves are replicated.
In a decision rule associated with a sparsity penalty level \lambda_1, we recall the definition of the two following functions. The sensitivity, also called the true positive rate or recall, is given by : \begin{align*}
\lambda_1 &\mapsto \text{sensitivity}(\lambda_1) = \frac{TP(\lambda_1)}{TP(\lambda_1) + FN(\lambda_1)}.
\end{align*} Specificity, also called true negative rate or selectivity, is defined as follows: \begin{align*}
\lambda_1 &\mapsto \text{specificity}(\lambda_1) = \frac{TN(\lambda_1)}{TN(\lambda_1) + FP(\lambda_1)}.
\end{align*} The ROC curve with the parameter \lambda_1 represents \text{sensitivity}(\lambda_1) as a function of 1 - \text{specificity}(\lambda_1) which is the false positive rate.
For each configuration (n, p fixed), we generate 50 replications and their associated ROC curves, which are then averaged. The average ROC curves for the three models are given in Figure 4, Figure 5 and Figure 6 by varying \frac{n}{p}\in \{0.5,1,2\}.
# The code used to generate the following results is based on R and python libraries/functions.# the reticulate package allows to compile python code in Rstudio# First set up the python dependencies before running the codelibrary(reticulate)config <- reticulate::py_config() # Initialize python engine # It is possible to create an isolated virtual or conda environment in which the code will be compile# by calling reticulate::virtualenv_create() or reticulate::conda_create() and then activate the environment with reticulate::use_condaenv() or reticulate::use_virtualenv(). system2(config$python, c("-m", "pip", "install",shQuote("git+https://github.com/neurospin/pylearn-parsimony.git")))# The pylean-parsimony require scipy version 1.7.1. More recent versions generate compilation errors.reticulate::py_install(packages =c("scipy == 1.7.1","scikit-learn","numpy == 1.22.4","six", # If needed"matplotlib"))# To check if all required python dependencies are available.testthat::expect_true(reticulate::py_module_available("numpy"))testthat::expect_true(reticulate::py_module_available("scipy"))testthat::expect_true(reticulate::py_module_available("six"))testthat::expect_true(reticulate::py_module_available("matplotlib"))testthat::expect_true("scikit-learn"%in% reticulate::py_list_packages()$package)testthat::expect_true("pylearn-parsimony"%in% reticulate::py_list_packages()$package)# It might be necessary to reload Rstudio on some operator systems.source("./rscripts/mglasso_functions/onload.R")
Show the code
# Performances calculation ------------------------------------------------# Launched on a cluster using 72 cores## Settings ----------------------------------------------------------------### Model -------------------------------------------------------------------# NA values for some parameter mean they are not relevantp <-40seq_n <-c(20, 40, 80)seq_rho <-0.95seq_dnsty <-NAtype <-NAalpha <-rep(1/5, 5)ngroup <-length(alpha)pi <-diag(0.75, ngroup)### Simulation --------------------------------------------------------------n_simu <-50list_ii_rho <-configs_simu(n_simu, seq_rho, seq_dnsty, seq_n, type)no_cores <-min(72, length(list_ii_rho))### Erdos -------------------------------------------------------------------runtime_roc_config_p40_erdos01 <-system.time( roc_config_p40_erdos01 <-mclapply( list_ii_rho, FUN = one_simu_ROC, model ="erdos",mc.cores = no_cores))save(roc_config_p40_erdos01, file =paste0(path_roc, "roc_config_p40_erdos01.RData"))save(runtime_roc_config_p40_erdos01, file =paste0(path_roc, "runtime_roc_config_p40_erdos01.RData"))## Erdosload(paste0(path_roc, "roc_config_p40_erdos01.RData")) dt_full <- roc_config_p40_erdos01### Merge in one graph# Three sample sizes are used and the vector c(20,40,80) is replicated 50 times# I subset the dataframe in three parts corresponding to the relevant sample sizesindex <-seq(1, 150, by =3)roc_dt20 <- dt_full[index]index <-seq(2, 150, by =3)roc_dt40 <- dt_full[index]index <-seq(3, 150, by =3)roc_dt80 <- dt_full[index]# Here we compute the mean over the 50 ROC curvesroc_dt20 <-get_mean_ROC_stat(roc_dt20)roc_dt40 <-get_mean_ROC_stat(roc_dt40)roc_dt80 <-get_mean_ROC_stat(roc_dt80)# I restructure the list result in a matrix for plotroc_dt20 <-reformat_roc_res_for_ggplot(roc_dt20)roc_dt20$np <-0.5# add a ratio n over p variableroc_dt40 <-reformat_roc_res_for_ggplot(roc_dt40)roc_dt40$np <-1roc_dt80 <-reformat_roc_res_for_ggplot(roc_dt80)roc_dt80$np <-2roc_dtf_erdos <-rbind(roc_dt20, roc_dt40, roc_dt80)
Figure 6: ROC curves for the Stochastic Block model comparing MGLasso and GLasso methods.
Based on these empirical results, we first observe that, in all the considered simulation models, MGLasso improves over GLasso in terms of support recovery in the high-dimensional setting where p<n. In addition, in the absence of a total variation penalty, i.e., \lambda_2 = 0, MGLasso performs no worse than GLasso in each of the 3 models. However, for \lambda_2>0, increasing penalty value does not seem to significantly improve the support recovery performances for the MGLasso, as we observe similar results for \lambda_2=3.3,6.6,10. Preliminary analyses show that, as \lambda_2 increases, the estimates of the regression vectors are shrunk towards 0. This shrinkage effect of group-fused penalty terms was also observed in (Chu et al. 2021).
4.3 Clustering
In order to obtain clustering performance, we compared the partitions estimated by MGLasso, Hierarchical Agglomerative Clustering (HAC) with Ward’s distance and K-means to the true partition in a stochastic block model framework. Euclidean distances between variables are used for HAC and K-means. The criterion used for the comparison is the adjusted Rand index. We studied the influence of the correlation level inside clusters on the clustering performances through two different parameters: \rho \in \{ 0.1, 0.3 \}; the vector of cluster proportions is fixed at \mathbf \pi = (1/5, \dots, 1/5). We then simulate 100 Gaussian data sets for each simulation configuration (\rho, n/p fixed).The optimal sparsity penalty for MGLasso is chosen by the Stability Approach to Regularization Selection (StARS) method (Liu, Roeder, and Wasserman 2010), and we vary the parameter \lambda_2.
Show the code
# Launch simulations ------------------------------------------------------## Settings ----------------------------------------------------------------### Model -------------------------------------------------------------------p <-40seq_n <-c(20, 40, 80) alpha <-rep(1/5, 5)seq_rho <-c(0.25, 0.95)seq_dnsty <-c(0.75)type <-NA#1 ## unused to do: delete in configs_simu parametersngroup <-length(alpha)pi <-diag(0.75, ngroup)### Simulation --------------------------------------------------------------n_simu <-100list_ii_rho <-configs_simu(n_simu, seq_rho, seq_dnsty, seq_n, type)mc_cores <-min(80, length(list_ii_rho))RNGkind("L'Ecuyer-CMRG")## Test --------------------------------------------------------------------# For a quicker test: # #set nl2 to 2 in one_simu_extended# set p = 9 & n = 10test <-one_simu_extended(list_ii_rho$`1`, verbose =TRUE, model ="block_diagonal")## Models ------------------------------------------------------------------# After the quicker test: # reset nl2 to 20### Stochastic Block Diagonal -----------------------------------------------runtime_rand50_config_p40_bdiagflip001_allcor <-system.time( rand50_config_p40_bdiagflip001_allcor <-mclapply( list_ii_rho, FUN = one_simu_extended, model ="block_diagonal",mc.cores = mc_cores))save(runtime_rand50_config_p40_bdiagflip001_allcor, file =paste0(path_extended, "runtime_rand100_config_p40_bdiagflip001_allcor.RData"))save(rand50_config_p40_bdiagflip001_allcor, file =paste0(path_extended, "rand100_config_p40_bdiagflip001_allcor.RData"))
Show the code
# Clustering # Settings: # - Inter-clusters edge probability $0.01$ (flip on all the missing edges) ## Rand index load(paste0(path_extended, "rand100_config_p40_bdiagflip001_allcor.RData")) dt <- rand100_config_p40_bdiagflip001_allcor# Calculate clusters partitions with thresh_fuse as the required difference threshold for merging two regression vectorslist_res <-lapply(dt, function(e){get_perf_from_raw("rand", e, thresh_fuse =1e-6)})dt_rand <-do.call(rbind, list_res)save(dt_rand,file =paste0(path_extended, "dt_rand_p40_bdiagflip001_allcor_thresh_e6.RData"))## Theoritical correlation set to $0.25$plot_res(dt_rand, crit_ ="rand", ncluster_ =c(5, 10, 15, 20), cor_ =0.25, np_ =c(0.5, 1, 2))## Theoritical correlation set to $0.95$plot_res(dt_rand, crit_ ="rand", ncluster_ =c(5, 10, 15, 20), cor_ =0.95, np_ =c(0.5, 1, 2))# The files `rand_dt_higher_cor_sbm.RData` and `rand_dt_lower_cor_sbm.RData` are obtained from splitting `dt_rand`according to theoritical correlation levels.
Figure 7: Adjusted Rand Indices for the HAC, k-means and MGLasso methods. Performances are observed for 4 different number of clusters in a low correlation context.
Figure 8: Adjusted Rand Indices for the HAC, k-means and MGLasso methods. Performances are observed for 4 different number of clusters in a high correlation context.
The results shown in Figure 7 and Figure 8 demonstrate that, particularly at the lower to medium levels of the hierarchy (between 20 and 10 clusters), the hierarchical clustering structure uncovered by MGLasso is comparable to popular clustering methods used in practice. For the higher levels (5 clusters), the performances of MGLasso deteriorate. As expected, the three compared methods also deteriorate as the level of correlation inside clusters decreases.
5 Analysis of microbial associations data
We finally illustrate our new method of inferring the multiscale Gaussian graphical model, with an application to the analysis of microbial associations in the American Gut Project. The data used are count data that have been previously normalized by applying the log-centered ratio technique as used in (Kurtz 2015). After some filtering steps (Kurtz 2015) on the operational taxonomic units OTUs counts (removed if present in less than 37\% of the samples) and the samples (removed if sequencing depth below 2700), the top OTUs are grouped in a dataset composed of n_1 = 289 for 127 OTUs. As a preliminary analysis, we perform a hierarchical agglomerative clustering (HAC) on the OTUs, which allows us to identify four significant groups. The correlation matrix of the dataset is given in Figure 9; variables have been rearranged according to the HAC partition.
The average correlations within blocks of variables belonging to the same cluster are given below. We observe relatively high levels of correlation in small blocks, similar to the simulation models used to evaluate the performance of clustering in the Simulation Experiments section.
We apply MGLasso to the normalised counts to infer a graph and a clustering structure. The graphs obtained by MGLasso for \lambda_2 = 0 and for \lambda_2 = 5 (corresponding respectively 127 and 80 clusters) are given below. In each case, the parameter \lambda_1 is chosen by stability selection (see Model Selection section).
Figure 10: Infered graphs using MGLasso for different values of total variation penalty.
The variables were reordered according to the clustering partition obtained from the distances between the regression vectors. Increasing \lambda_2 reduces the number of clusters and leads to a shrinking effect on the estimates. The adjacency matrix of the neighborhood selection equivalent to setting \lambda_2 to 0 is given in Figure 10 (up). In Figure 10 (down), the deduced partition is composed of 80 clusters. A confusion matrix comparing the edges deduced by MGLasso with \lambda_2 = 5 and neighborhood selection is given below. Adding a total variation parameter increases the merging effect, resulting in a larger number of edges in the graph.
Neighborhood selection
MGLasso (\lambda_2 = 5)
non-edges
edges
non-edges
15678
0
edges
288
163
6 Conclusion
We proposed a new technique that combines Gaussian Graphical Model inference and hierarchical clustering called MGLasso. The method proceeds via convex optimization and minimizes the neighborhood selection objective penalized by a hybrid regularization combining a sparsity-inducing norm and a convex clustering penalty. We developed a complete numerical scheme to apply MGLasso in practice, with an optimization algorithm based on CONESTA and a model selection procedure. Our simulations results over synthetic and real datasets showed that MGLasso can perform better than GLasso in network support recovery in the presence of groups of correlated variables, and we illustrated the method with the analysis of microbial associations data. The present work paves the way for future improvements: first, by incorporating prior knowledge through more flexible weighted regularization; second, by studying the theoretical properties of the method in terms of statistical guarantees for the MGLasso estimator. Moreover the node-wise regression approach on which our method is based can be extended to a broader family of non-gaussian distributions belonging to the exponential family as outlined by Yang et al. (2012).
Ambroise, Christophe, Julien Chiquet, and Catherine Matias. 2009. “Inferring sparse gaussian graphical models with latent structure.”Electronic Journal of Statistics 3 (0): 205–38. https://doi.org/10.1214/08-EJS314.
Banerjee, Onureena, Laurent El Ghaoui, and Alexandre d’Aspremont. 2008. “Model Selection Through Sparse Maximum Likelihood Estimation for Multivariate Gaussian or Binary Data” 9 (June): 485–516.
Beck, Amir, and Marc Teboulle. 2009. “A Fast Iterative Shrinkage-Thresholding Algorithm for Linear Inverse Problems.”SIAM J. Imaging Sciences 2 (January): 183–202. https://doi.org/10.1137/080716542.
Bien, Jacob, and Robert J Tibshirani. 2011. “Sparse Estimation of a Covariance Matrix.”Biometrika 98 (4): 807–20.
Bühlmann, Peter, Philipp Rütimann, Sara Van De Geer, and Cun-Hui Zhang. 2012. “Correlated variables in regression: clustering and sparse estimation.”
Cai, Tony, Weidong Liu, and Xi Luo. 2011. “A Constrained L1 Minimization Approach to Sparse Precision Matrix Estimation.”Journal of the American Statistical Association 106 (494): 594–607. https://doi.org/10.1198/jasa.2011.tm10155.
Chakraborty, Saptarshi, and Jason Xu. 2020. “Biconvex Clustering.”arXiv Preprint arXiv:2008.01760.
Cheng, Lulu, Liang Shan, and Inyoung Kim. 2017. “Multilevel Gaussian graphical model for multilevel networks.”Journal of Statistical Planning and Inference 190 (November): 1–14. https://doi.org/10.1016/j.jspi.2017.05.003.
Chu, Shuyu, Huijing Jiang, Zhengliang Xue, and Xinwei Deng. 2021. “Adaptive Convex Clustering of Generalized Linear Models with Application in Purchase Likelihood Prediction.”Technometrics 63 (2): 171–83.
Devijver, Emilie, and Mélina Gallopin. 2018. “Block-Diagonal Covariance Selection for High-Dimensional Gaussian Graphical Models.”Journal of the American Statistical Association 113 (521): 306–14. https://doi.org/10.1080/01621459.2016.1247002.
Erdős, Paul, Alfréd Rényi, et al. 1960. “On the Evolution of Random Graphs.”Publ. Math. Inst. Hung. Acad. Sci 5 (1): 17–60.
Fan, Jianqing, Yuan Liao, and Han Liu. 2016. “An Overview of the Estimation of Large Covariance and Precision Matrices.”The Econometrics Journal 19 (1): C1–32. https://doi.org/https://doi.org/10.1111/ectj.12061.
Fienberg, Stephen E, and Stanley S Wasserman. 1981. “Categorical Data Analysis of Single Sociometric Relations.”Sociological Methodology 12: 156–92.
Foygel, Rina, and Mathias Drton. 2010. “Extended Bayesian Information Criteria for Gaussian Graphical Models.”arXiv Preprint arXiv:1011.6640.
Friedman, Jerome, Trevor Hastie, and Robert Tibshirani. 2007. “Sparse inverse covariance estimation with the graphical lasso.”
Ganguly, Apratim, and Wolfgang Polonik. 2014. “Local Neighborhood Fusion in Locally Constant Gaussian Graphical Models.”https://arxiv.org/abs/1410.8766.
Giraud, Christophe, Sylvie Huet, and Nicolas Verzelen. 2012. “Graph Selection with GGMselect.”Statistical Applications in Genetics and Molecular Biology 11 (3).
Grechkin, Maxim, Maryam Fazel, Daniela Witten, and Su-In Lee. 2015. “Pathway Graphical Lasso.” In Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence, 2617–23. AAAI’15. Austin, Texas: AAAI Press.
Hadj-Selem, Fouad, Tommy Lofstedt, Elvis Dohmatob, Vincent Frouin, Mathieu Dubois, Vincent Guillemot, and Edouard Duchesnay. 2018. “Continuation of Nesterov’s Smoothing for Regression with Structured Sparsity in High-Dimensional Neuroimaging.”IEEE Transactions on Medical Imaging 2018. https://doi.org/10.1109/TMI.2018.2829802.
Hocking, T., Jean-Philippe Vert, F. Bach, and Armand Joulin. 2011. “Clusterpath: An Algorithm for Clustering Using Convex Fusion Penalties.” In ICML.
Honorio, Jean, Dimitris Samaras, Nikos Paragios, Rita Goldstein, and Luis E Ortiz. 2009. “Sparse and Locally Constant Gaussian Graphical Models.”Advances in Neural Information Processing Systems 22: 745–53.
Hsieh, Cho-Jui, Mátyás A. Sustik, Inderjit S. Dhillon, and Pradeep Ravikumar. 2014. “QUIC: Quadratic Approximation for Sparse Inverse Covariance Estimation.”Journal of Machine Learning Research 15 (83): 2911–47. http://jmlr.org/papers/v15/hsieh14a.html.
Kurtz, Christian L. AND Miraldi, Zachary D. AND Müller. 2015. “Sparse and Compositionally Robust Inference of Microbial Ecological Networks.”PLOS Computational Biology 11 (May): 1–25. https://doi.org/10.1371/journal.pcbi.1004226.
Lindsten, F., H. Ohlsson, and L. Ljung. 2011. “Clustering Using Sum-of-Norms Regularization: With Application to Particle Filter Output Computation.” In 2011 IEEE Statistical Signal Processing Workshop (SSP), 201–4. https://doi.org/10.1109/SSP.2011.5967659.
Mazumder, Rahul, and Trevor Hastie. 2012. “The graphical lasso: New insights and alternatives.”Electronic Journal of Statistics 6 (none): 2125–49. https://doi.org/10.1214/12-EJS740.
Meinshausen, Nicolai, and Peter Bühlmann. 2006. “High-dimensional graphs and variable selection with the Lasso.”Annals of Statistics 34 (3): 1436–62. https://doi.org/10.1214/009053606000000281.
———. 2010. “Stability Selection.”Journal of the Royal Statistical Society: Series B (Statistical Methodology) 72 (4): 417–73.
Newman, Mark EJ, Steven H Strogatz, and Duncan J Watts. 2001. “Random Graphs with Arbitrary Degree Distributions and Their Applications.”Physical Review E 64 (2): 026118.
Park, Mee Young, Trevor Hastie, and Robert Tibshirani. 2006. “Averaged gene expressions for regression.”Biostatistics 8 (2): 212–27. https://doi.org/10.1093/biostatistics/kxl002.
Peng, Jie, Pei Wang, Nengfeng Zhou, and Ji Zhu. 2009. “Partial Correlation Estimation by Joint Sparse Regression Models.”Journal of the American Statistical Association 104 (486): 735–46. https://doi.org/10.1198/jasa.2009.0126.
Rocha, Guilherme V., Peng Zhao, and Bin Yu. 2008. “A Path Following Algorithm for Sparse Pseudo-Likelihood Inverse Covariance Estimation (SPLICE).”
Rothman, Adam J., Peter J. Bickel, Elizaveta Levina, and Ji Zhu. 2008. “Sparse permutation invariant covariance estimation.”Electronic Journal of Statistics 2 (none): 494–515. https://doi.org/10.1214/08-EJS176.
Tan, Kean Ming, Daniela Witten, and Ali Shojaie. 2013. “The Cluster Graphical Lasso for improved estimation of Gaussian graphical models,” July. http://arxiv.org/abs/1307.5339.
Tibshirani, R. 1996. “Regression Shrinkage and Selection via the Lasso.”Journal of the Royal Statistical Society (Series B) 58: 267–88.
Yang, Eunho, Genevera Allen, Zhandong Liu, and Pradeep Ravikumar. 2012. “Graphical Models via Generalized Linear Models.”Advances in Neural Information Processing Systems 25.
Yao, Tianyi, and Genevera I. Allen. 2019. “Clustered Gaussian Graphical Model via Symmetric Convex Clustering.” In 2019 IEEE Data Science Workshop (DSW), 76–82. https://doi.org/10.1109/DSW.2019.8755774.
Yuan, Ming. 2010. “High Dimensional Inverse Covariance Matrix Estimation via Linear Programming.”Journal of Machine Learning Research 11 (79): 2261–86. http://jmlr.org/papers/v11/yuan10b.html.
Yuan, Ming, and Yi Lin. 2007. “Model selection and estimation in the Gaussian graphical model.”Biometrika 94 (1): 19–35. https://doi.org/10.1093/biomet/asm018.
Zhao, Tuo, Han Liu, Kathryn Roeder, John Lafferty, and Larry Wasserman. 2012. “The Huge Package for High-Dimensional Undirected Graph Estimation in r.”The Journal of Machine Learning Research 13 (1): 1059–62.
@article{sanou,
author = {Edmond Sanou and Christophe Ambroise and Geneviève Robin},
title = {Inference of {Multiscale} {Gaussian} {Graphical} {Model}},
journal = {Computo},
date = {},
url = {https://github.com/desanou/multiscale_glasso},
doi = {xxxx},
langid = {en},
abstract = {Gaussian Graphical Models (GGMs) are widely used for
exploratory data analysis in various fields such as genomics,
ecology, psychometry. In a high-dimensional setting, when the number
of variables exceeds the number of observations by several orders of
magnitude, the estimation of GGM is a difficult and unstable
optimization problem. Clustering of variables or variable selection
is often performed prior to GGM estimation. We propose a new method
allowing to simultaneously infer a hierarchical clustering structure
and the graphs describing the structure of independence at each
level of the hierarchy. This method is based on solving a convex
optimization problem combining a graphical lasso penalty with a
fused type lasso penalty. Results on real and synthetic data are
presented.}
}
For attribution, please cite this work as:
Edmond Sanou, Christophe Ambroise, and Geneviève Robin. n.d.
“Inference of Multiscale Gaussian Graphical Model.”Computo. https://doi.org/xxxx.
Source Code
---title: "Inference of Multiscale Gaussian Graphical Model"subtitle: ""author: - name: Edmond Sanou url: https://desanou.github.io/ affiliation: "LaMME, Université d'Evry Val d'Essonne" affiliation-url: http://www.math-evry.cnrs.fr/ - name: Christophe Ambroise url: https://cambroise.github.io/ affiliation: "LaMME, Université d'Evry Val d'Essonne" affiliation-url: http://www.math-evry.cnrs.fr/ - name: Geneviève Robin url: https://genevieverobin.wordpress.com/ affiliation: "CNRS, LaMME, Université d'Evry Val d'Essonne" affiliation-url: http://www.math-evry.cnrs.fr/date: last-modifiedabstract: >+ Gaussian Graphical Models (GGMs) are widely used for exploratory data analysis in various fields such as genomics, ecology, psychometry. In a high-dimensional setting, when the number of variables exceeds the number of observations by several orders of magnitude, the estimation of GGM is a difficult and unstable optimization problem. Clustering of variables or variable selection is often performed prior to GGM estimation. We propose a new method allowing to simultaneously infer a hierarchical clustering structure and the graphs describing the structure of independence at each level of the hierarchy. This method is based on solving a convex optimization problem combining a graphical lasso penalty with a fused type lasso penalty. Results on real and synthetic data are presented.citation: type: article-journal container-title: "Computo" doi: "xxxx" url: https://github.com/desanou/multiscale_glassogithub: https://github.com/desanou/multiscale_glassobibliography: references.bibrepo: "multiscale_glasso"editor: markdown: wrap: 72---[](https://github.com/desanou/%7B%7B%3C%20meta%20repo%20%3E%7D%7D)[](http://creativecommons.org/licenses/by/4.0/)<!-- Pour les commentaires, voici les commandes à utiliser -->```{=html}<style>old {color: silver}revision { color: Orange }genevieve { color: Cyan }edmond { color: Red }</style>```<!-- Exemple <auteur> commentaire </auteur> --><!-- Pour barrer du texte: --><!-- ~~text~~ -->```{r setup, message=FALSE, echo = TRUE}knitr::opts_chunk$set(tidy =FALSE, out.width="49%", out.height="20%",fig.show='hold',fig.align='center')options(htmltools.dir.version =FALSE)library(ggplot2)library(simone)library(SpiecEasi)library(huge)library(Matrix)library(ghibli)```# IntroductionProbabilistic graphical models [@Lauritzen1996; @Koller2009] are widelyused in high-dimensional data analysis to synthesize the interactionbetween variables. In many applications, such as genomics or imageanalysis, graphical models reduce the number of parameters by selectingthe most relevant interactions between variables. Undirected GaussianGraphical Models (GGMs) are a class of graphical models used in Gaussiansettings. In the context of high-dimensional statistics, graphicalmodels are generally assumed sparse, meaning that a small number ofvariables interact, compared to the total number of possibleinteractions. This assumption has been shown to provide both statisticaland computational advantages by simplifying the structure of dependencebetween variables [@Dempster1972] and allowing efficient algorithms[@Meinshausen2006]. See, for instance, [@Fan2016] for a review aboutsparse graphical models inference.In GGMs, it is well known (see, e.g., @Lauritzen1996) that inferring thegraphical model or equivalently the Conditional Independence Graph (CIG)boils down to inferring the support of the precision matrix$\mathbf{\Omega}$ (the inverse of the variance-covariance matrix). To doso, several $\ell_1$ penalized methods have been proposed in theliterature to learn the CIG of GGMs. For instance, the neighborhoodselection [@Meinshausen2006] (MB) based on a nodewise regressionapproach via the least absolute shrinkage and selection operator (Lasso,@tibshirani1996) is a popular method. Each variable is regressed on theothers, taking advantage of the link between the so-obtained regressioncoefficients and partial correlations. More precisely, for all$1 \le i \le p$, the following problem is solved:$$\hat{\boldsymbol{\beta}^i}(\lambda) = \underset{\boldsymbol{\beta}^i \in \mathbb{R}^{p-1}}{\operatorname{argmin}} \frac{1}{n} \left \lVert \mathbf{X}^i - \mathbf{X}^{\setminus i} \boldsymbol{\beta}^i \right \rVert_2 ^2 + \lambda \left \lVert \boldsymbol{\beta}^i \right \rVert_1.$$ {#eq-neighborhood}In @eq-neighborhood, $\lambda$ is a non negative regularizationparameter and $\mathbf{X}^{\setminus i}$ denotes the matrix $\mathbf{X}$deprived of column $i.$ The MB method defined by the estimation problem<edmond> in </edmond> @eq-neighborhood has generated a long line of workin the field of nodewise regression methods. For instance, @Rocha2008,@Ambroise2009 showed that nodewise regression could be seen as apseudo-likelihood approximation, and @Peng2009 extended the MB method toestimate sparse partial correlations using a single regression problem.Other inference methods similar to nodewise regression include a methodbased on Dantzig selector [@Yuan2010] and the introduction of the Climeestimator [@Cai2011].Another family of sparse CIG inference methods directly estimates$\mathbf{\Omega}$ via direct minimization of the $\ell_1$-penalizednegative log-likelihood [@Banerjee2008], without resorting to theauxiliary regression problem. This method, called the graphical lasso[@Friedman2007], benefits from many optimization algorithms [@Yuan2007;@Rothman2008; @Banerjee2008; @Hsieh2014].Such inference methods are widely used and enjoy many favorabletheoretical and empirical properties, including robustness tohigh-dimensional problems. However, some limitations have been observed,particularly in the presence of strongly correlated variables. Theselimitations are caused by known impairments of Lasso-type regularizationin this context [@Buhlmann2012; @Park2007]. To overcome this, inaddition to sparsity, several previous works attempt to estimate CIG byintegrating clustering structures among variables for the sake of bothstatistical sanity and interpretability. A non-exhaustive list of worksthat integrate a clustering structure to speed up or improve theestimation procedure includes [@Honorio2009; @Ambroise2009;@Mazumder2012; @Tan2013; @Yao2019; @Devijver2018].The above methods exploit the group structure to simplify the graphinference problem and infer the CIG between single variables. Anotherquestion that has received less attention is the inference of the CIGbetween the groups of variables, i.e., between the meta-variablesrepresentative of the group structure. A recent work introducinginference of graphical models on multiple grouping levels is[@Cheng2017]. They proposed inferring the CIG of gene data on two levelscorresponding to genes and pathways, respectively. Note that pathwaysare groups of functionally related genes known in advance. The inferenceis achieved by optimizing a penalized maximum likelihood that estimatesa sparse network at both gene and group levels. Our work is also part ofthis dynamic. We introduce a penalty term allowing parsimonious networksto be built at different hierarchical clustering levels. The maindifference with the procedure of [@Cheng2017] is that we do not requireprior knowledge of the group structure, which makes the problemsignificantly more complex. In addition, our method has the advantage ofproposing CIGs at more than two levels of granularity.We introduce the Multiscale Graphical Lasso (MGLasso), a novel method toestimate simultaneously a hierarchical clustering structure, andgraphical models depicting the conditional independence structurebetween clusters of variables at each level of the hierarchy. Theprocedure is based on a convex optimization problem with a hybridpenalty term combining a graphical Lasso and a group-fused Lassopenalty. In the spirit of convex hierarchical clustering, introduced by[@Hocking2011; @Lindsten2011], the hierarchy is obtained by spanning theentire regularization path. At each level of the hierarchy, variables inthe same clusters are represented by a meta-variable; the new CIG isestimated between these new meta-variables, leading to a multiscalegraphical model. Unlike [@Yao2019], who introduced convex clustering inGGMs, our approach is expected to produce sparse estimations, thanks toan additional $\ell_1$ penalty.The remainder of this paper is organized as follows. In section[Multiscale Graphical Lasso], we formally introduce the MultiscaleGraphical Lasso based on a convex estimation problem and an optimizationalgorithm based on the continuation of Nesterov's smoothing technique[@hadjselem2018]. Section [Simulation experiments] presents numericalresults on simulated and real data.# Multiscale Graphical LassoThe proposed method aims at inferring a graphical Gaussian model whilehierarchically grouping variables. It infers conditional independencebetween different groups of variables. The approach is based onneighborhood selection [@Meinshausen2006] and considers an additionalfused-Lasso type penalty for clustering. In the spirit of hierarchicalconvex clustering, the hierarchical structure is recovered by spanningthe regularization path.Let $X = (X^1, \dots, X^p)^T \in \mathbb R^p$ be a $p$-dimensionalGaussian random vector, with mean vector $\mu$ and covariance matrix$\mathbf \Sigma$. The conditional independence structure of $X$ ischaracterized by a graph $G = (V, E)$, where $V = \{1,\ldots p\}$ is theset of variables and $E$ the set of edges, uniquely determined by thesupport of the precision matrix $\mathbf{\Omega} = \mathbf{S}^{-1}$(see, e.g., @Dempster1972). In other words, for any two vertices$i,j\in V$, the edge $(i,j)$ belongs to the set $E$ if and only if$\Omega_{ij} \neq 0$, that is if and only if the $i$-th and $j$-thvariables are conditionally independent given all the others i.e.$X^i \perp \!\!\! \perp X^j |X^{\setminus (i, j)}$ where$X^{\setminus (i, j)}$ is the set of all $p$ variables deprived ofvariables $i$ and $j$.Considering the linear model$X^i = \sum_{j\neq i} \beta^i_j X_j + \epsilon_i$ where $\epsilon_i$ isa Gaussian centered random variable, we have$\beta^i_j = -\frac{\Omega_{ij}}{\Omega_{ii}}$. We define the regressionmatrix$\boldsymbol{\beta} := [{\boldsymbol{\beta}^1}^T, \ldots, {\boldsymbol{\beta}^p}^T]^T \in \mathbb{R}^{p \times (p-1)}$,whose rows are the regression vectors for each of the $p$ regressions.Let the $n \times p$-dimensional matrix $\mathbf{X}$ contain $n$independent observations of $X$. We propose to minimize the followingcriterion which combines Lasso and group-fused Lasso penalties:$$J_{\lambda_1, \lambda_2}(\boldsymbol{\beta}; \mathbf{X} ) = \frac{1}{2} \sum_{i=1}^p \left \lVert \mathbf{X}^i - \mathbf{X}^{\setminus i} \boldsymbol{\beta}^i \right \rVert_2 ^2 + \lambda_1 \sum_{i = 1}^p \left \lVert \boldsymbol{\beta}^i \right \rVert_1 + \lambda_2 \sum_{i < j} \left \lVert \boldsymbol{\beta}^i - \tau_{ij}(\boldsymbol{\beta}^j) \right \rVert_2,$$ {#eq-cost-fct}where $\tau_{ij}$ is a permutation exchanging the coefficients$\boldsymbol{\beta}^j_j$ and $\boldsymbol{\beta}^j_i$ and leaves othercoefficients untouched, $\mathbf{X}^{i}\in \mathbb{R}^n$ denotes the$i$-th column of $\mathbf{X}$, $\boldsymbol{\beta}_{i}$ denotes the$i$-th row of $\beta$, $\lambda_1$ and $\lambda_2$ are penalizationparameters. Let us consider$$\hat{\boldsymbol{\beta}} \in \underset{\boldsymbol{\beta}}{\operatorname{argmin}} J_{\lambda_1, \lambda_2}(\boldsymbol{\beta}, \mathbf{X}).$$The lasso penalty term encourages sparsity and the penalty term$\left \lVert \boldsymbol{\beta}^i - \tau_{ij}(\boldsymbol{\beta}^j) \right \rVert_2$encourages to fuse regression vectors $\boldsymbol{\beta}^i$ and$\boldsymbol{\beta}^j$. These fusions uncover a clustering structure.The model is likely to cluster together variables that have the sameconditional effects on the others. Variables $X^i$ and $X^j$ areassigned to the same cluster when$\boldsymbol{\beta}^i = \tau_{ij}(\boldsymbol{\beta}^j).$<old> Let us illustrate by an example the effect of the proposedapproach. If we consider a group of $q$ variables whose regressionvectors have at least $q$ non-zero coefficients and further assume thatfor each pair of group variables $i$ and $j$,$\|\boldsymbol{\beta}^i - \tau_{ij} (\boldsymbol{\beta}^j)\|_2=0$. Aftersome permutations, we get a $q\times q$ block of non-zeros coefficient$\beta_{ij}$ corresponding to the group in the $\boldsymbol{\beta}$matrix, where $(i,j)\in \{1,\cdots, q\}^2$. If we consider threedifferent indices $i,j,k \ \in \{1,\cdots, q\}^3$, it is straightforwardto show that the six coefficients indexed by $(i,j,k)$ are all equal.Thus the distance constraints between vectors $\boldsymbol{\beta}^i$ ofa group forces equality of all regression coefficients in the group.</old><revision> Let us illustrate by an example the effect of the proposedapproach. Two variables $i$ and $j$ are in the same group when$\|\boldsymbol{\beta}^i - \tau_{ij} (\boldsymbol{\beta}^j)\|_2=0$.Considering a cluster $\mathcal C$ of $q$ variables, it isstraightforward to show that $\forall (i,j) \in \mathcal C^2$, we have$\beta_{ij}=\beta_{\mathcal C}$. Thus the algorithm tend to produceprecision matrices with blocks of constant entries for a given value of$\lambda_2$. Each block corresponds to a cluster. </revision>Let us illustrate by an example the effect of the proposed approach. Ifwe consider a group of $q$ variables whose regression vectors have atleast $q$ non-zero coefficients and further assume that for each pair ofgroup variables $i$ and $j$,$\|\boldsymbol{\beta}^i - \tau_{ij} (\boldsymbol{\beta}^j)\|_2=0$. Aftersome permutations, we get a $q\times q$ block of non-zeros coefficient$\beta_{ij}$ corresponding to the group in the $\boldsymbol{\beta}$matrix, where $(i,j)\in \{1,\cdots, q\}^2$. If we consider threedifferent indices $i,j,k \ \in \{1,\cdots, q\}^3$, it is straightforwardto show that the six coefficients indexed by $(i,j,k)$ are all equal.Thus the distance constraints between vectors $\boldsymbol{\beta}^i$ ofa group forces equality of all regression coefficients in the group.The greater the regularization weight $\lambda_2$, the larger groupsbecome. This is the core principle of the convex relaxation ofhierarchical clustering introduced by @Hocking2011. Hence, we can derivea hierarchical clustering structure by spanning the regularization pathobtained by varying $\lambda_2$ while $\lambda_1$ is fixed. The additionof a fused-type term in graphical models inference has been studiedpreviously by authors such as @Honorio2009, @ganguly2014, @Grechkin2015.However, these existing methods require prior knowledge of theneighborhood of each variable. On the contrary, our approach allowssimultaneous inference of a multi-level graphical model and ahierarchical clustering of the variables.In practice, if some information about the clustering structure isavailable, the problem can be generalized into:$$\min_{\boldsymbol{\beta}} \sum_{i=1}^p\frac{1}{2} \left \lVert \mathbf{X}^i - \mathbf{X}^{\setminus i} \boldsymbol{\beta}^i \right \rVert_2 ^2 + \lambda_1 \sum_{i = 1}^p \left \lVert \boldsymbol{\beta}^i \right \rVert_1 + \lambda_2 \sum_{i < j} w_{ij} \left \lVert \boldsymbol{\beta}^i - \tau_{ij}(\boldsymbol{\beta}^j) \right \rVert_2,$$ {#eq-cost-fct-general}where $w_{ij}$ is a positive weight encoding prior knowledge of thegroups to which variables $i$ and $j$ belong to. <edmond> A good choicefor these pairwise affinities can help save time in computation andimprove clustering performances. The choice can be guided by theapplication however improper specification can give rise to difficultyinterpretable clustering structures [@chakraborty2020biconvex].</edmond> In the remainder of the paper, we will <old> consider </old><edmond> assume that </edmond> $w_{ij} = 1$ <edmond> for simplicity.</edmond># Numerical schemeThis section introduces a complete numerical scheme to apply MGLasso inpractice, using a convex optimization algorithm and a model selectionprocedure. Section [Optimization via CONESTA algorithm] reviews theprinciples of the Continuation with Nesterov smoothing in ashrinkage-thresholding algorithm (CONESTA, @hadjselem2018), theoptimization algorithm used in practice to solve MGLasso. Section[Reformulation of MGLasso for CONESTA algorithm] details a reformulationof MGLasso, which enables us to apply CONESTA. Finally, Section [Modelselection] presents the procedure used to select the regularizationparameters.## Optimization via CONESTA algorithmThe optimization problem for Multiscale Graphical Lasso is convex butnot straightforward to solve using classical algorithms because of thefused-lasso type penalty, which is non-separable and admits noclosed-form solution for the proximal gradient. We rely on theContinuation with Nesterov smoothing in a shrinkage-thresholdingalgorithm (CONESTA, @hadjselem2018), dedicated to high-dimensionalregression problems with structured sparsity such as group structures.The CONESTA solver, initially introduced for neuro-imaging problems,addresses a general class of convex optimization problems which includesgroup-wise penalties. <old> admitting loss functions of the form: </old><edmond> The algorithm solves problems in the form </edmond><old> $$f(\boldsymbol{\theta} ) = g(\boldsymbol{\theta}) + \lambda_1 h(\boldsymbol{\theta}) + \lambda_2 s(\boldsymbol{\theta}),$$</old><edmond>$$\operatorname{minimize} \quad f(\boldsymbol{\theta}) = g(\boldsymbol{\theta}) + \lambda_1 h(\boldsymbol{\theta}) + \lambda_2 s(\boldsymbol{\theta}),$$ {#eq-conesta-criterion} </edmond><edmond> with parameters vector $\boldsymbol{\theta}\in \mathbb{R}^d$where $\lambda_1$ and $\lambda_2$ are penalty parameters.<edmond><old> where $\lambda_1$ and $\lambda_2$ are penalty weights, and$\boldsymbol{\theta}\in \mathbb{R}^d$ is a $d$-dimensional vector ofparameters to optimize.</old>In the original paper [@hadjselem2018], <old> the function </old>$g(\boldsymbol{\theta})$ is <old> the sum of a least squares criterionand a ridge penalty </old> a differentiable function,$h(\boldsymbol{\theta})$ is a penalty function whose proximal operator$\operatorname{prox}_{\lambda_1 h}$ is known in closed-form.<edmond> Given $\phi \subseteq \{1,\ldots, d\}$, let$\boldsymbol{\theta}_\phi = (\theta_i)_{i \in \phi}$ denote thesubvector of $\boldsymbol{\theta}$ referenced by the indices in $\phi.$Denote $\Phi = \{ \phi_1, \dots, \phi_{\operatorname{Card}(\Phi)}\}$ acollection with $\phi_i \subseteq \{1,\ldots, d\}.$ Let the matrix$\mathbf{A}_\phi \in \mathbb{R}^{m \times \operatorname{Card}(\Phi) }$define a linear map from $\mathbb{R}^{\operatorname{Card}(\phi)}$ to$\mathbb{R}^m$ by sending the column vector$\boldsymbol{\theta}_\phi \in \mathbb{R}^{\operatorname{Card}(\phi)}$ tothe column vector$\mathbf{A}_\phi \boldsymbol{\theta}_\phi \in \mathbb{R}^m.$ Thefunction $s(\boldsymbol{\theta})$ is assumed to be an $\ell_{1,2}$-normi.e. the sum of the group-wise $\ell_2$-norms of the elements$\mathbf{A}_\phi \boldsymbol{\theta}_\phi, \phi \in \Phi.$ Namely,</edmond><old> and $s(\boldsymbol{\theta})$ is an $\ell_{1,2}$ penalty of theform </old>$$s(\boldsymbol{\theta}) = \sum_{\phi \in \Phi} \|\mathbf{A}_\phi \boldsymbol{\theta}_\phi\|_2.$$<old> In the definition of $s(\boldsymbol{\theta})$,$\Phi = \{ \phi_1, \dots, \phi_{\operatorname{Card}(\Phi)}\}$ is a setof subsets of indices, i.e., $\Phi_i\subset \{1,\ldots, d\}$ for all$i\in\{1,\ldots,\operatorname{Card}(\Phi)\}$ and, for all$\phi\in \Phi$, $\boldsymbol{\theta}_\phi$ is the sub-vector of$\boldsymbol{\theta}$ defined by$\boldsymbol{\theta}_\phi = (\theta_i)_{i\in\phi}$. Finally,$\mathbf{D}_\phi$ are linear operators. \<\old\><edmond> When $\mathbf{A}_\phi$ is the identity operator such as$\mathbf{A}_\phi \boldsymbol{\theta}_\phi = \boldsymbol{\theta}_\phi$,the penalty function $s$ is the overlapping group-lasso and$m = \operatorname{Card}(\phi)$. When it is a discrete derivativeoperator, therefore $s$ is a total variation penalty and $m$ can be seenas the number of neighborhood relationships.</edmond><old> The main ingredient of CONESTA is the approximation of </old> Thenon-smooth $\ell_{1,2}$-norm penalty <old> with unknown proximalgradient </old>, can be approximated by a smooth function with known<old> proximal </old> gradient computed using Nesterov's smoothing[@nesterov2005smooth]. Given a <edmond> smoothness </edmond> parameter$\mu>0$, let us define the smooth approximation$$ s_{\mu}(\boldsymbol{\theta}) = \max_{\boldsymbol{\alpha} \in \mathcal{K}} \left \{ \boldsymbol{\alpha}^T \mathbf{A} \boldsymbol{\theta} - \frac{\mu}{2} \| \boldsymbol{\alpha} \|_2^2 \right \},$$ where $\mathcal{K}$ is <edmond> the cartesian product </edmond> of $\ell_2$-unit balls, <edmond> $\mathbf{A}$ is the vertical concatenationof the matrices $\mathbf{A}_\phi$ and $\boldsymbol{\alpha}$ is anauxiliary variable resulting from the dual reformulation of$s(\boldsymbol{\theta})$ </edmond>. Note that$\lim_{\mu \rightarrow 0} s_{\mu}(\boldsymbol{\theta}) = s(\boldsymbol{\theta}).$<edmond> A Fast Iterative Shrinkage-Thresholding Algorithm <edmond> (FISTA, @Beck2009) step canthen be applied after computing the gradient of the smooth part <edmond>i.e. $g(\boldsymbol{\theta}) + \lambda_2 s_{\mu}(\boldsymbol{\theta})$ </edmond>of the approximated criterion.<edmond> The main ingredient of CONESTA remains in the the determinationof the optimal smoothness parameter using the duality gap, whichminimizes the number of FISTA iterations for a given precision$\epsilon.$ The specification of $\mu$ is subject to dynamic update. Asequence of decreasing optimal smoothness parameters is generated inorder to dynamically adapt the FISTA algorithm stepsize towards$\epsilon.$ Namely, $\mu^k = \mu_{opt}(\epsilon^k).$ The smoothnessparameter decreases as one get closer to $\boldsymbol{\theta} ^\star$,the solution of the problem defined in @eq-conesta-criterion. Since$\boldsymbol{\theta} ^\star$ is unknown, the approximation of thedistance to the minimum is achieved via the duality gap. Indeed $$\operatorname{GAP}_{\mu^k}(\boldsymbol{\theta}^k) \ge f_{\mu^k}(\boldsymbol{\theta}^k) - f(\boldsymbol{\theta}^\star) \ge 0.$$ We refer the reader to the seminal paper for more details on theformulation of $\operatorname{GAP}_{\mu^k}(\boldsymbol{\theta}^k).$ TheCONESTA routine is spelled out in the algorithm CONESTA solver where$L(g + \lambda_2 s_{\mu})$ is the Lipschitz constant of$\nabla(g + \lambda_2 s_{\mu})$.</edmond><old> At each iteration of theCONESTA algoithm, the smoothing parameter $\mu$ is updated dynamicallyusing the duality gap, and a new approximation is computed.</old> TheCONESTA <old> algorithm enjoys </old> a linear convergence rate, and wasshown empirically to outperform other computational options forstructured-sparsity problems such as ADMM and inexact FISTA in terms ofconvergence speed [@hadjselem2018].::: {#conesta}```{=html}<pre id="conesta">\begin{algorithm}\caption{CONESTA solver}\begin{algorithmic} \STATE \textbf{Inputs}: \\ $\quad$ functions $g(\boldsymbol{\theta}), h(\boldsymbol{\theta}), s(\boldsymbol{\theta})$ \\ $\quad$ precision $\epsilon$ \\ $\quad$ penalty parameters $\lambda_1, \lambda_2$ \\ $\quad$ decreasing factor $\tau \in (0,1)$ for sequence of precisions \STATE \textbf{Output:} \\ $\quad$ $\boldsymbol{\theta}^{i+1} \in \mathbb{R}^d$ \STATE \textbf{Initializations:} \\ $\quad \boldsymbol{\theta}^0 \in \mathbb{R}^d$ \\ $\quad \epsilon^0 = \tau \operatorname{GAP}_{\mu = 10^{-8}}(\boldsymbol{\theta}^0)$ \\ $\quad \mu^0 = \mu_{opt}(\epsilon^0)$ \Repeat \STATE $\epsilon^i_{\mu} = \epsilon^i - \mu^i \lambda_2 \frac{d}{2}$ \\ \COMMENT{FISTA} \STATE $k=2$ \COMMENT{new iterator} \STATE $\boldsymbol{\theta}_{\operatorname{FISTA}}^1 = \boldsymbol{\theta}_{\operatorname{FISTA}}^0 = \boldsymbol{\theta}^i$ \COMMENT{Initial parameters value} \STATE $t_{\mu} = \frac{1}{L(g + \lambda_2 s_{\mu})}$ \COMMENT{Compute stepsize} \Repeat \STATE $\boldsymbol{z} = \boldsymbol{\theta}_{\operatorname{FISTA}}^{k-1} + \frac{k-2}{k+1}(\boldsymbol{\theta}_{\operatorname{FISTA}}^{k-1} - \boldsymbol{\theta}_{\operatorname{FISTA}}^{k-2})$ \STATE $\boldsymbol{\theta}_{\operatorname{FISTA}}^k = \operatorname{prox}_{\lambda_1 h}(\boldsymbol{z} - t_{\mu} \nabla(g + \lambda_2 s_{\mu})(\boldsymbol{z}))$ \Until{$\operatorname{GAP}_{\mu}(\boldsymbol{\theta}_{\operatorname{FISTA}}^k) \le \epsilon_{\mu}^i$} \STATE $\boldsymbol{\theta}^{i+1} = \boldsymbol{\theta}_{\operatorname{FISTA}}^k$ \\ \STATE $\epsilon^i = \operatorname{GAP}_{\mu = \mu_i} \boldsymbol{\theta}^{i+1} + \mu^i \lambda_2 \frac{d}{2}$ \\ \STATE $\epsilon^{i+1} = \tau \epsilon^{i}$ \\ \STATE $\mu^{i+1} = \mu_{opt}(\epsilon^{i+1})$ \Until{$\epsilon^i \le \epsilon$}\end{algorithmic}\end{algorithm}</pre>```:::## Reformulation of MGLasso for CONESTA algorithmTo apply CONESTA, it is necessary to reformulate the MGLasso problem inorder to comply with the form of loss function required by CONESTA. Theobjective of MGLasso can indeed be written as$$\operatorname{argmin} \frac{1}{2} ||\mathbf{Y} - \tilde{\mathbf{X}} \tilde{\boldsymbol{\beta}}||_2^2 + \lambda_1 ||\tilde{\boldsymbol{\beta}}||_1 + \lambda_2 \sum_{i<j} ||\boldsymbol D_{ij} \tilde{\boldsymbol{\beta}}||_2,$$ {#eq-refpbm}where$\mathbf{Y} = \operatorname{Vec}(\mathbf{X}) \in \mathbb{R}^{np}, \tilde{\boldsymbol{\beta}} = \operatorname{Vec(\boldsymbol{\beta})} \in \mathbb{R}^{p (p-1)}, \tilde{\mathbf{X}}$is a $\mathbb{R}^{[np]\times [p \times (p-1)]}$ block-diagonal matrixwith $\mathbf{X}^{\setminus i}$ on the $i$-th block. The matrix$\boldsymbol A_{ij}$ is a $(p-1)\times p(p-1)$ matrix <old> defined by</old><edmond> chosen so that$\boldsymbol D_{ij} \tilde{\boldsymbol{\beta}} = \boldsymbol{\beta}^i - \tau_{ij}(\boldsymbol{\beta}^j).$</edmond><old>$\boldsymbol D_{ij} (k, l) = \begin{cases} 1, \ \text{if} \ l =(i-1)p+k, \\ -1, \ \text{if} \ l = (j-1)p+k,\\ 0, \ \text{otherwise.} \end{cases}$</old>Note that we introduce this notation for simplicity of exposition, but,in practice, the sparsity of the matrices $\boldsymbol D_{ij}$ allows amore efficient implementation. Based on reformulation @eq-refpbm, we mayapply CONESTA to solve the objective of MGLasso for fixed $\lambda_1$and $\lambda_2$. The procedure is applied, for fixed $\lambda_1$, to arange of decreasing values of $\lambda_2$ to obtain a hierarchicalclustering. The corresponding pseudo-code is given in the followingalgorithm where $(\mathbf{X}^i)^{\dagger}$ denotes the pseudo-inverse of$\mathbf{X}^i$ and $\epsilon_{fuse}$ the threshold for merging clusters.::: {#algo-mglasso}```{=html}<pre id="algo-mglasso">\begin{algorithm}\caption{MGLasso algorithm}\begin{algorithmic} \STATE \textbf{Inputs}: \\ $\quad$ Set of variables $\mathbf{X} = \{\mathbf{X}^1, \dots, \mathbf{X}^p \} \in \mathbb R^{n\times p}$ \\ $\quad$ Penalty parameters $\lambda_1 \ge 0, {\lambda_2}_{\operatorname{initial}} > 0$ \\ $\quad$ Increasing factor $\eta > 1$ for fusion penalties $\lambda_2$\\ $\quad$ Fusion threshold $\epsilon_{fuse} \ge 0$ \STATE \textbf{Outputs:} For $\lambda_1$ fixed and $\lambda_2$ from $0$ to ${\lambda_2}_{\operatorname{initial}} \times \eta^{(I)}$ with $I$ the number of iterations: \\ $\quad$ Regression vectors $\boldsymbol{\beta}(\lambda_1, \lambda_2) \in \mathbb R^{p \times (p-1)}$, \\ $\quad$ Clusters of variables indices $C(\lambda_1, \lambda_2) =\{ C_k \}_{(\lambda_1, \lambda_2)}, k=1, \dots, K$ \STATE \textbf{Initializations:} \\ $\quad$ $\boldsymbol{\beta}^i = (\mathbf{X}^i)^{\dagger}\mathbf{X}^i$, $\forall i = 1, \dots, p$ for warm start in CONESTA solver \\ $\quad$ $C = \left \{\{1\}, \dots, \{p\}\right \}$ Initial clusters with one element per cluster. \\ $\quad$ Set $\lambda_2 = 0$ \\ $\quad$ Compute $\boldsymbol{\beta}$ using CONESTA solver (Equivalent to Meinshausen-Bühlmann approach) \\ $\quad$ Update clusters $C$ with rule described in \textbf{while} loop. \STATE \textbf{Set:} $\lambda_2 = {\lambda_2}_{\operatorname{initial}}$ \\ \COMMENT{Clustering path} \WHILE{$\operatorname{Card}(C) > 1$} \STATE Compute $\boldsymbol{\beta}$ using CONESTA solver with warm start from previous iteration \\ \COMMENT{Clusters update} \STATE Compute pairwises distances $d(i,j)=\left \lVert \boldsymbol{\beta}^i - \tau_{ij}(\boldsymbol{\beta}^j) \right \rVert_2$, $\forall i,j \in \{1, \dots, p\}$ \\ \STATE Determine clusters $C_k (k=1, \dots, K)$ with the rule $(i,j) \in C_k$ iff. $d(i,j) \le \epsilon_{fuse}$ \STATE $\lambda_2 = \lambda_2 \times \kappa$ \ENDWHILE\end{algorithmic}\end{algorithm}</pre>```:::## Model selectionA crucial question for practical applications is the definition of arule to select the penalty parameters ($\lambda_1, \lambda_2$). Thisselection problem operates at two levels: $\lambda_1$ controls thesparsity of the graphical model, and $\lambda_2$ controls the number ofclusters in the optimal clustering partition. These two parameters aredealt with separately: the sparsity parameter $\lambda_1$ is chosen viamodel selection, while the clustering parameter $\lambda_2$ variesacross a grid of values, in order to obtain graphs with different levelsof granularity. The problem of model selection in graphical models isdifficult in the high dimensional case where the number of samples issmall compared to the number of variables, as classical AIC and BICcriteria tend to perform poorly [@Liu2010]. Alternative criteria havebeen proposed in the literature, such as cross-validation[@bien2011sparse], GGMSelect [@giraud2012graph], stability selection[@meinshausen2010stability; @Liu2010], Extended Bayesian InformationCriterion (EBIC) [@foygel2010extended], and Rotation InformationCriterion [@zhao2012huge].In this paper, we focused on the StARS stability selection approachproposed by @Liu2010. The method uses $k$ subsamples of data to estimatethe associated graphs for a given range of $\lambda_1$ values. For eachvalue, a global instability of the graph edges is computed. The optimalvalue of $\lambda_1$ is chosen so as to minimize the instability, asfollows. Let $\lambda^{(1)}_1, \dots, \lambda_1^{(K)}$ be a grid ofsparsity regularization parameters, and $S_1, \dots, S_N$ be $N$bootstrap samples obtained by sampling the rows of the data set$\mathbf{X}$. For each $k\in\{1,\ldots,K\}$ and for each$j\in\{1,\ldots, N\}$, we denote by $\mathcal{A}^{k,j}(\mathbf{X})$ theadjacency matrix of the estimated graph obtained by applying theinference algorithm to $S_n$ with regularization parameter$\lambda_1^{(k)}$. For each possible edge $(s,t)\in\{1,\ldots,p\}^2$,the probability of edge appearance is estimated empirically by$$\hat \theta_{st}^{(k)} = \frac{1}{N} \sum_{j=1}^N \mathcal{A}^{k,j}_{st}.$$Define$$\hat \xi_{st}(\Lambda) = 2 \hat \theta_{st} (\Lambda) \left ( 1 - \hat \theta_{st} (\Lambda) \right )$$the empirical instability of edge $(s,t)$ (that is, twice the varianceof the Bernoulli indicator of edge $(s,t)$). The instability levelassociated to $\lambda_1^{(k)}$ is given by$$\hat D(\lambda_1^{(k)}) = \frac{\sum_{s<t} \hat \xi_{st}(\lambda_1^{(k)})}{p \choose 2},$$ StARS selects the optimal penalty parameter as follows$$\hat \lambda = \max_k\left\{ \lambda_1^{(k)}: \hat D(\lambda_1^{(k)}) \le \beta, k\in\{1,\ldots,K\} \right \},$$where $\beta$ is the threshold chosen for the instability level.# Simulation experimentsIn this section, we conduct a simulation study to evaluate theperformance of the MGLasso method, both in terms of clustering andsupport recovery. Receiver Operating Characteristic (ROC) curves areused to evaluate the adequacy of the inferred graphs with the <old> reality </old><edmond> ground truth </edmond>for the MGLasso and GLasso methods in the Erdös-Renyi, Scale-free, andStochastic Block Models frameworks. The Adjusted Rand indices are usedto compare the partitions obtained with MGLasso, hierarchicalagglomerative clustering, and K-means clustering in a stochastic blockmodel framework.## Synthetic data modelsWe consider three different synthetic network models: the StochasticBlock Model (SBM, @fienberg1981categorical), the Erdös-Renyi model[@erdHos1960evolution] and the Scale-Free model [@newman2001random]. Ineach case, Gaussian data is generated by drawing $n$ independentrealizations of a multivariate Gaussian distribution$\mathcal N(0, \mathbf{\Sigma})$ where$\mathbf{\Sigma} \in \mathbb{R}^{p \times p}$ and$\mathbf{\Omega} = \mathbf{\Sigma} ^{-1}$. The support of$\mathbf{\Omega}$, equivalent to the network adjacency matrix, isgenerated from the three different models. The difficulty level of theproblem is controlled by varying the ratio $\frac{n}{p}$ with $p$ fixedat $40$: $\frac{n}{p}\in \{0.5,1,2\}$.### Stochastic Block-ModelWe construct a block-diagonal precision matrix $\mathbf{\Omega}$ asfollows. First, we generate the support of $\mathbf{\Omega}$ as shown in@fig-model-sbm, denoted by $\boldsymbol A\in\{0,1\}^{p\times p}$. To dothis, the variables are first partitioned into $K = 5$ hidden groups,noted $C_1, \dots, C_K$ described by a latent random variable $Z_i$,such that $Z_i = k$ if $i = C_k$. $Z_i$ follows a multinomialdistribution$$ P(Z_i = k) = \pi_k, \quad \forall k \in \{1, \dots, K\},$$ where$\pi = (\pi_1, \dots, \pi_k)$ is the vector of proportions of clusterswhose sum is equal to one. The set of latent variables is noted$\mathbf{Z} = \{ Z_1, \dots, Z_K\}$. Conditionally to $\mathbf{Z}$,$A_{ij}$ follows a Bernoulli distribution such that$$A_{ij}|Z_i = k, Z_j = l \sim \mathcal{B}(\alpha_{kl}), \quad \forall k,l \in \{1 \dots, K\},$$where $\alpha_{kl}$ is the probability of inter-cluster connectivity,with $\alpha_{kl} = 0.01$ if $k\neq l$ and $\alpha_{ll} = 0,75$. For$k\in\{1,\ldots, K\}$, we define$p_k = \sum_{i=1}^p \boldsymbol{1}_{\{Z_i = k\}}$. The precision matrix$\mathbf{\Omega}$ of the graph is then calculated as follows. We define$\Omega_{ij} = 0$ if $Z_i\neq Z_j$ ; otherwise, we define$\Omega_{ij} = A_{ij}\omega_{ij}$ where, for all $i\in\{1,\ldots,p\}$and for all $j\in\{1,\ldots,p| Z_j = Z_i\}$, $\omega_{ij}$ is given by :If $\alpha_{ll}$ were to be equal to one, this construction of$\mathbf{\Omega}$ would make it possible to control the level ofcorrelation between the variables in each block to $\rho$. Introducing amore realistic scheme with $\alpha_{ll}=0.75$ allows only to have anapproximate control.::: {#fig-model-sbm}```{r echo = TRUE}bloc_diag <-function(n_vars, connectivity_mat, prop_clusters, rho) { true_nclusters <-0while (true_nclusters !=length(prop_clusters)){ ## To make sure we have the required number of clusters network <- simone::rNetwork(n_vars, connectivity_mat, prop_clusters) true_nclusters <-length(unique(network$clusters)) } precision_mat <- network$A[order(network$clusters), order(network$clusters)] eff_clusters <-table(network$clusters) b =-rho/(1+rho*(eff_clusters-2) -rho^2*(eff_clusters-1)) d = (1+rho*(eff_clusters-2))/(1+rho*(eff_clusters-2) -rho^2*(eff_clusters-1))for (i in1:length(eff_clusters)) { temp <- precision_mat[which(row.names(precision_mat) == i), which(row.names(precision_mat) == i)] temp <-as.matrix(temp) temp[temp !=0] = b[i]diag(temp) = d[i] precision_mat[which(row.names(precision_mat) == i), which(row.names(precision_mat) == i)] = temp } flag <-min(svd(precision_mat)$d)if(flag <0){diag(precision_mat) <-1- flag }return(precision_mat)}#' simulate data with given graph structuresim_data <-function(p =20,np_ratio =2,structure =c("block_diagonal", "hub", "scale_free", "erdos"), alpha, prob_mat, rho, g,inter_cluster_edge_prob =0.01,p_erdos =0.1,verbose =FALSE){ structure <-match.arg(structure) n =round(np_ratio * p)switch (structure,erdos = { network <- simone::rNetwork(p, p_erdos, 1) correlation <-solve(network$Theta) X <- mvtnorm::rmvnorm(n, sigma = correlation) graph <- network$Thetaif(verbose) message("Data, precision matrix and graph generated from Erdos structure.") },hub = { L = huge::huge.generator(graph ="hub", g =5, d = p, n = n, verbose =FALSE) X=L$data graph = L$omega correlation = L$sigmaif(verbose) message("Data, precision matrix and graph generated from Hub structure.") },scale_free = { L = huge::huge.generator(graph ="scale-free", d = p, n = n, verbose =FALSE) ## d edges graph X=L$data graph = L$omega correlation = L$sigmaif(verbose) message("Data, precision matrix and graph generated from Scale free structure.") },block_diagonal = { if(inter_cluster_edge_prob !=0) { # inter-clusters edges added in the flipped way flag <-TRUEwhile(flag) { K <-bloc_diag(p, prob_mat, alpha, rho) target_indices <-which(K[upper.tri(K)] ==0) # Random selection of edges to be set to 1 select_len <-round(length(target_indices) * inter_cluster_edge_prob) selected_indices <-sample(target_indices, select_len) precision_level <-unique(K[upper.tri(K)]) precision_level <-max(precision_level[precision_level !=0]) K[upper.tri(K)][selected_indices] <- precision_level K <-as.matrix(Matrix::forceSymmetric(K, uplo ="U")) flag <-any(eigen(K)$values <=0) # Control of positive definiteness } correlation <-solve(K) graph = K X <- mvtnorm::rmvnorm(n, sigma = correlation)if(verbose) message("Data, precision matrix and graph generated from block-diagonal structure with inter-clusters edges.") }else { # Only intra-cluster edges while approximately controlling correlation level K <-bloc_diag(p, prob_mat, alpha, rho) correlation <-solve(K) graph = K X <- mvtnorm::rmvnorm(n, sigma = correlation)if(verbose) message("Data, precision matrix and graph generated from block-diagonal structure with only intra-clusters edges.") } } )return(list(X=X, graph=graph, correlation=correlation))}adj_mat <-function(mat) {diag(mat) <-0 mat[ abs(mat) <1e-10] <-0 mat[mat !=0] <-1return(mat)}set.seed(2020)sim_sbm <-sim_data(p =40, structure ="block_diagonal", alpha =rep(1/5, 5), prob_mat =diag(0.75, 5), rho =0.2, inter_cluster_edge_prob =0.01)gsbm <-adj_mat(sim_sbm$graph)Matrix::image(as(gsbm, "sparseMatrix"),sub ="", xlab ="", ylab ="") ```Adjacency matrix of a stochastic block model with 5 blocks.:::### Erdös-Renyi ModelThe Erdös-Renyi model is a special case of the stochastic block modelwhere $\alpha_{kl} = \alpha_{ll} = \alpha$ is constant. We set thedensity $\alpha$ of the graph to $0.1$; see @fig-model-erdos for anexample of the graph resulting from this model.::: {#fig-model-erdos}```{r echo = TRUE}set.seed(2022)sim_erdos <-sim_data(p =40, structure ="erdos", p_erdos =0.1)gerdos <-adj_mat(sim_erdos$graph)Matrix::image(as(gerdos, "sparseMatrix"),sub ="", xlab ="", ylab ="")```Adjacency matrix of an Erdös-Renyi model.:::### Scale-free ModelThe Scale-free Model generates networks whose degree distributionsfollow a power law. The graph starts with an initial chain graph of $2$nodes. Then, new nodes are added to the graph one by one. Each new nodeis connected to an existing node with a probability proportional to thedegree of the existing node. We set the number of edges in the graph to$40$. An example of scale-free graph is shown in @fig-model-sfree.::: {#fig-model-sfree}```{r echo = TRUE}set.seed(2022)sim_sfree <-sim_data(p =40, structure ="scale_free")gsfree <-adj_mat(sim_sfree$graph)Matrix::image(as(gsfree, "sparseMatrix"),sub ="", xlab ="", ylab ="")```Adjacency matrix of a Scale-free model.:::## Support recoveryWe compare the network structure learning performance of our approach tothat of GLasso in its neighborhood selection version using ROC curves.In both GLasso and MGLasso, the sparsity is controlled by aregularization parameter $\lambda_1$; however, MGLasso admits anadditional regularization parameter, $\lambda_2$, which controls thestrength of convex clustering. To compare the two methods, in each ROCcurve, we vary the parameter $\lambda_1$ while the parameter $\lambda_2$(for MGLasso) is kept constant. We computed ROC curves for $4$ differentpenalty levels for the $\lambda_2$ parameter; since GLasso does notdepend on $\lambda_2$, the GLasso ROC curves are replicated.In a decision rule associated with a sparsity penalty level $\lambda_1$,we recall the definition of the two following functions. Thesensitivity, also called the true positive rate or recall, is given by :\begin{align*}\lambda_1 &\mapsto \text{sensitivity}(\lambda_1) = \frac{TP(\lambda_1)}{TP(\lambda_1) + FN(\lambda_1)}.\end{align*} Specificity, also called true negative rate or selectivity,is defined as follows: \begin{align*}\lambda_1 &\mapsto \text{specificity}(\lambda_1) = \frac{TN(\lambda_1)}{TN(\lambda_1) + FP(\lambda_1)}.\end{align*} The ROC curve with the parameter $\lambda_1$ represents$\text{sensitivity}(\lambda_1)$ as a function of$1 - \text{specificity}(\lambda_1)$ which is the false positive rate.For each configuration ($n, p$ fixed), we generate $50$ replications andtheir associated ROC curves, which are then averaged. The average ROCcurves for the three models are given in @fig-roc-erdos, @fig-roc-sfreeand @fig-roc-sbm by varying $\frac{n}{p}\in \{0.5,1,2\}$.```{r echo=TRUE}path_checks <-"./rscripts/mglasso_functions/"source(paste0(path_checks, "simulate.R"))source(paste0(path_checks, "plot.R"))``````{r eval = FALSE, echo = TRUE, include=TRUE}# The code used to generate the following results is based on R and python libraries/functions.# the reticulate package allows to compile python code in Rstudio# First set up the python dependencies before running the codelibrary(reticulate)config <- reticulate::py_config() # Initialize python engine # It is possible to create an isolated virtual or conda environment in which the code will be compile# by calling reticulate::virtualenv_create() or reticulate::conda_create() and then activate the environment with reticulate::use_condaenv() or reticulate::use_virtualenv(). system2(config$python, c("-m", "pip", "install",shQuote("git+https://github.com/neurospin/pylearn-parsimony.git")))# The pylean-parsimony require scipy version 1.7.1. More recent versions generate compilation errors.reticulate::py_install(packages =c("scipy == 1.7.1","scikit-learn","numpy == 1.22.4","six", # If needed"matplotlib"))# To check if all required python dependencies are available.testthat::expect_true(reticulate::py_module_available("numpy"))testthat::expect_true(reticulate::py_module_available("scipy"))testthat::expect_true(reticulate::py_module_available("six"))testthat::expect_true(reticulate::py_module_available("matplotlib"))testthat::expect_true("scikit-learn"%in% reticulate::py_list_packages()$package)testthat::expect_true("pylearn-parsimony"%in% reticulate::py_list_packages()$package)# It might be necessary to reload Rstudio on some operator systems.source("./rscripts/mglasso_functions/onload.R")``````{r eval = FALSE, echo = TRUE, include=TRUE}# Performances calculation ------------------------------------------------# Launched on a cluster using 72 cores## Settings ----------------------------------------------------------------### Model -------------------------------------------------------------------# NA values for some parameter mean they are not relevantp <-40seq_n <-c(20, 40, 80)seq_rho <-0.95seq_dnsty <-NAtype <-NAalpha <-rep(1/5, 5)ngroup <-length(alpha)pi <-diag(0.75, ngroup)### Simulation --------------------------------------------------------------n_simu <-50list_ii_rho <-configs_simu(n_simu, seq_rho, seq_dnsty, seq_n, type)no_cores <-min(72, length(list_ii_rho))### Erdos -------------------------------------------------------------------runtime_roc_config_p40_erdos01 <-system.time( roc_config_p40_erdos01 <-mclapply( list_ii_rho, FUN = one_simu_ROC, model ="erdos",mc.cores = no_cores))save(roc_config_p40_erdos01, file =paste0(path_roc, "roc_config_p40_erdos01.RData"))save(runtime_roc_config_p40_erdos01, file =paste0(path_roc, "runtime_roc_config_p40_erdos01.RData"))## Erdosload(paste0(path_roc, "roc_config_p40_erdos01.RData")) dt_full <- roc_config_p40_erdos01### Merge in one graph# Three sample sizes are used and the vector c(20,40,80) is replicated 50 times# I subset the dataframe in three parts corresponding to the relevant sample sizesindex <-seq(1, 150, by =3)roc_dt20 <- dt_full[index]index <-seq(2, 150, by =3)roc_dt40 <- dt_full[index]index <-seq(3, 150, by =3)roc_dt80 <- dt_full[index]# Here we compute the mean over the 50 ROC curvesroc_dt20 <-get_mean_ROC_stat(roc_dt20)roc_dt40 <-get_mean_ROC_stat(roc_dt40)roc_dt80 <-get_mean_ROC_stat(roc_dt80)# I restructure the list result in a matrix for plotroc_dt20 <-reformat_roc_res_for_ggplot(roc_dt20)roc_dt20$np <-0.5# add a ratio n over p variableroc_dt40 <-reformat_roc_res_for_ggplot(roc_dt40)roc_dt40$np <-1roc_dt80 <-reformat_roc_res_for_ggplot(roc_dt80)roc_dt80$np <-2roc_dtf_erdos <-rbind(roc_dt20, roc_dt40, roc_dt80)```::: {#fig-roc-erdos}```{r roc_erdos, message=FALSE, echo = TRUE}load("./data/roc_dtf_erdos.RData")ggplot(roc_dtf_erdos, aes(x = fpr, y = tpr, color = method )) +geom_line(size =0.7) +facet_grid(np ~ tv) +geom_abline(intercept =0, slope =1, linetype ="dashed", color ="grey") +xlab("False Positive Rate") +ylab("True Positive Rate") +ggtitle("") +scale_colour_manual(values =ghibli_palette("MarnieMedium1")[5:6])```ROC curves for the Erdös-Renyi model comparing MGLasso and GLassomethods.:::```{r eval = FALSE, echo = TRUE}# Performances calculation ------------------------------------------------# Launched on a cluster using 72 cores## Settings ----------------------------------------------------------------### Model -------------------------------------------------------------------# NA values for some parameter mean they are not relevantp <-40seq_n <-c(20, 40, 80)seq_rho <-0.95seq_dnsty <-NAtype <-NAalpha <-rep(1/5, 5)ngroup <-length(alpha)pi <-diag(0.75, ngroup)### Simulation --------------------------------------------------------------n_simu <-50list_ii_rho <-configs_simu(n_simu, seq_rho, seq_dnsty, seq_n, type)no_cores <-min(72, length(list_ii_rho))### Scale-Free --------------------------------------------------------------runtime_roc_config_p40_scalefree <-system.time( roc_config_p40_scalefree <-mclapply( list_ii_rho, FUN = one_simu_ROC, model ="scale_free",mc.cores = no_cores))save(roc_config_p40_scalefree, file =paste0(path_roc, "roc_config_p40_scalefree.RData"))save(runtime_roc_config_p40_scalefree, file =paste0(path_roc, "runtime_roc_config_p40_scalefree.RData"))## Scale-freeload(paste0(path_roc, "roc_config_p40_scalefree.RData")) dt_full <- roc_config_p40_scalefree### Merge in one graphindex <-seq(1, 150, by =3)roc_dt20 <- dt_full[index]index <-seq(2, 150, by =3)roc_dt40 <- dt_full[index]index <-seq(3, 150, by =3)roc_dt80 <- dt_full[index]roc_dt20 <-get_mean_ROC_stat(roc_dt20)roc_dt40 <-get_mean_ROC_stat(roc_dt40)roc_dt80 <-get_mean_ROC_stat(roc_dt80)roc_dt20 <-reformat_roc_res_for_ggplot(roc_dt20)roc_dt20$np <-0.5roc_dt40 <-reformat_roc_res_for_ggplot(roc_dt40)roc_dt40$np <-1roc_dt80 <-reformat_roc_res_for_ggplot(roc_dt80)roc_dt80$np <-2roc_dtf_sfree <-rbind(roc_dt20, roc_dt40, roc_dt80)### Savesave(roc_dtf_sfree,file =paste0(path_roc, "roc_dtf_sfree.RData"))```::: {#fig-roc-sfree}```{r roc_scale_free , message=FALSE, echo = TRUE}load("./data/roc_dtf_sfree.RData")ggplot(roc_dtf_sfree, aes(x = fpr, y = tpr, color = method )) +geom_line() +facet_grid(np ~ tv) +geom_abline(intercept =0, slope =1, linetype ="dashed", color ="grey") +xlab("False Positive Rate") +ylab("True Positive Rate") +ggtitle("") +scale_colour_manual(values =ghibli_palette("MarnieMedium1")[5:6])```ROC curves for the Scale-free model comparing MGLasso and GLassomethods.:::```{r eval = FALSE, echo = TRUE}# Launched on a cluster using 72 cores## Settings ----------------------------------------------------------------### Model -------------------------------------------------------------------# NA values for some parameter mean they are not relevantp <-40seq_n <-c(20, 40, 80)seq_rho <-0.95seq_dnsty <-NAtype <-NAalpha <-rep(1/5, 5)ngroup <-length(alpha)pi <-diag(0.75, ngroup)### Simulation --------------------------------------------------------------n_simu <-50list_ii_rho <-configs_simu(n_simu, seq_rho, seq_dnsty, seq_n, type)no_cores <-min(72, length(list_ii_rho))### Stochastic Block Diagonal -----------------------------------------------runtime_roc_config_p40_bdiagflip001 <-system.time( roc_config_p40_bdiagflip001 <-mclapply( list_ii_rho, FUN = one_simu_ROC, model ="block_diagonal",mc.cores = no_cores))save(roc_config_p40_bdiagflip001, file =paste0(path_roc, "roc_config_p40_bdiagflip001.RData"))save(runtime_roc_config_p40_bdiagflip001, file =paste0(path_roc, "runtime_roc_config_p40_bdiagflip001.RData"))load(paste0(path_roc, "roc_config_p40_bdiagflip001.RData")) dt_full <- roc_config_p40_bdiagflip001### Merge in one graphindex <-seq(1, 150, by =3)roc_dt20 <- dt_full[index]index <-seq(2, 150, by =3)roc_dt40 <- dt_full[index]index <-seq(3, 150, by =3)roc_dt80 <- dt_full[index]roc_dt20 <-get_mean_ROC_stat(roc_dt20)roc_dt40 <-get_mean_ROC_stat(roc_dt40)roc_dt80 <-get_mean_ROC_stat(roc_dt80)roc_dt20 <-reformat_roc_res_for_ggplot(roc_dt20)roc_dt20$np <-0.5roc_dt40 <-reformat_roc_res_for_ggplot(roc_dt40)roc_dt40$np <-1roc_dt80 <-reformat_roc_res_for_ggplot(roc_dt80)roc_dt80$np <-2roc_dtf_sbm <-rbind(roc_dt20, roc_dt40, roc_dt80)save(roc_dtf_sbm,file =paste0(path_roc, "roc_dtf_sbm.RData"))```::: {#fig-roc-sbm}```{r roc_sbm , message=FALSE, echo = TRUE}load("./data/roc_dtf_sbm.RData")ggplot(roc_dtf_sbm, aes(x = fpr, y = tpr, color = method )) +geom_line() +facet_grid(np ~ tv) +geom_abline(intercept =0, slope =1, linetype ="dashed", color ="grey") +xlab("False Positive Rate") +ylab("True Positive Rate") +ggtitle("") +scale_colour_manual(values =ghibli_palette("MarnieMedium1")[5:6])```ROC curves for the Stochastic Block model comparing MGLasso and GLassomethods.:::Based on these empirical results, we first observe that, in all theconsidered simulation models, MGLasso improves over GLasso in terms ofsupport recovery in the high-dimensional setting where $p<n$. Inaddition, in the absence of a total variation penalty, i.e.,$\lambda_2 = 0$, MGLasso performs no worse than GLasso in each of the$3$ models. However, for $\lambda_2>0$, increasing penalty value doesnot seem to significantly improve the support recovery performances forthe MGLasso, as we observe similar results for $\lambda_2=3.3,6.6,10$.Preliminary analyses show that, as $\lambda_2$ increases, the estimatesof the regression vectors are shrunk towards $0$. This shrinkage effectof group-fused penalty terms was also observed in [@chu2021adaptive].## ClusteringIn order to obtain clustering performance, we compared the partitionsestimated by MGLasso, Hierarchical Agglomerative Clustering (HAC) withWard's distance and K-means to the true partition in a stochastic blockmodel framework. Euclidean distances between variables are used for HACand K-means. The criterion used for the comparison is the adjusted Randindex. We studied the influence of the correlation level inside clusterson the clustering performances through two different parameters:$\rho \in \{ 0.1, 0.3 \}$; the vector of cluster proportions is fixed at$\mathbf \pi = (1/5, \dots, 1/5)$. We then simulate $100$ Gaussian datasets for each simulation configuration ($\rho$, $n/p$ fixed).The optimalsparsity penalty for MGLasso is chosen by the Stability Approach toRegularization Selection (StARS) method [@Liu2010], and we vary theparameter $\lambda_2.$```{r eval=FALSE, echo=TRUE, include=TRUE}# Launch simulations ------------------------------------------------------## Settings ----------------------------------------------------------------### Model -------------------------------------------------------------------p <-40seq_n <-c(20, 40, 80) alpha <-rep(1/5, 5)seq_rho <-c(0.25, 0.95)seq_dnsty <-c(0.75)type <-NA#1 ## unused to do: delete in configs_simu parametersngroup <-length(alpha)pi <-diag(0.75, ngroup)### Simulation --------------------------------------------------------------n_simu <-100list_ii_rho <-configs_simu(n_simu, seq_rho, seq_dnsty, seq_n, type)mc_cores <-min(80, length(list_ii_rho))RNGkind("L'Ecuyer-CMRG")## Test --------------------------------------------------------------------# For a quicker test: # #set nl2 to 2 in one_simu_extended# set p = 9 & n = 10test <-one_simu_extended(list_ii_rho$`1`, verbose =TRUE, model ="block_diagonal")## Models ------------------------------------------------------------------# After the quicker test: # reset nl2 to 20### Stochastic Block Diagonal -----------------------------------------------runtime_rand50_config_p40_bdiagflip001_allcor <-system.time( rand50_config_p40_bdiagflip001_allcor <-mclapply( list_ii_rho, FUN = one_simu_extended, model ="block_diagonal",mc.cores = mc_cores))save(runtime_rand50_config_p40_bdiagflip001_allcor, file =paste0(path_extended, "runtime_rand100_config_p40_bdiagflip001_allcor.RData"))save(rand50_config_p40_bdiagflip001_allcor, file =paste0(path_extended, "rand100_config_p40_bdiagflip001_allcor.RData"))``````{r eval=FALSE, echo=TRUE, include=TRUE}# Clustering # Settings: # - Inter-clusters edge probability $0.01$ (flip on all the missing edges) ## Rand index load(paste0(path_extended, "rand100_config_p40_bdiagflip001_allcor.RData")) dt <- rand100_config_p40_bdiagflip001_allcor# Calculate clusters partitions with thresh_fuse as the required difference threshold for merging two regression vectorslist_res <-lapply(dt, function(e){get_perf_from_raw("rand", e, thresh_fuse =1e-6)})dt_rand <-do.call(rbind, list_res)save(dt_rand,file =paste0(path_extended, "dt_rand_p40_bdiagflip001_allcor_thresh_e6.RData"))## Theoritical correlation set to $0.25$plot_res(dt_rand, crit_ ="rand", ncluster_ =c(5, 10, 15, 20), cor_ =0.25, np_ =c(0.5, 1, 2))## Theoritical correlation set to $0.95$plot_res(dt_rand, crit_ ="rand", ncluster_ =c(5, 10, 15, 20), cor_ =0.95, np_ =c(0.5, 1, 2))# The files `rand_dt_higher_cor_sbm.RData` and `rand_dt_lower_cor_sbm.RData` are obtained from splitting `dt_rand`according to theoritical correlation levels.```::: {#fig-ari-low-cor}```{r echo = TRUE}load("./data/rand_dt_lower_cor_sbm.RData")plot_res(dt_rand, crit_ ="rand", ncluster_ =c(5, 10, 15, 20), cor_ =0.25, np_ =c(0.5, 1, 2), main ="")```Adjusted Rand Indices for the HAC, k-means and MGLasso methods.Performances are observed for 4 different number of clusters in a lowcorrelation context.:::::: {#fig-ari-high-cor}```{r echo = TRUE}load("./data/rand_dt_higher_cor_sbm.RData")plot_res(dt_rand, crit_ ="rand", ncluster_ =c(5, 10, 15, 20), cor_ =0.95, np_ =c(0.5, 1, 2),main ="")```Adjusted Rand Indices for the HAC, k-means and MGLasso methods.Performances are observed for 4 different number of clusters in a highcorrelation context.:::The results shown in @fig-ari-low-cor and @fig-ari-high-cor demonstratethat, particularly at the lower to medium levels of the hierarchy(between 20 and 10 clusters), the hierarchical clustering structureuncovered by MGLasso is comparable to popular clustering methods used inpractice. For the higher levels (5 clusters), the performances ofMGLasso deteriorate. As expected, the three compared methods alsodeteriorate as the level of correlation inside clusters decreases.# Analysis of microbial associations dataWe finally illustrate our new method of inferring the multiscaleGaussian graphical model, with an application to the analysis ofmicrobial associations in the American Gut Project. The data used arecount data that have been previously normalized by applying thelog-centered ratio technique as used in [@Kurtz2015]. After somefiltering steps [@Kurtz2015] on the operational taxonomic units OTUscounts (removed if present in less than $37\%$ of the samples) and thesamples (removed if sequencing depth below 2700), the top OTUs aregrouped in a dataset composed of $n_1 = 289$ for $127$ OTUs. As apreliminary analysis, we perform a hierarchical agglomerative clustering(HAC) on the OTUs, which allows us to identify four significant groups.The correlation matrix of the dataset is given in @fig-emp-cor;variables have been rearranged according to the HAC partition.::: {#fig-emp-cor}```{r echo = TRUE}data(amgut1.filt, package ="SpiecEasi")dta <- amgut1.filtdta <-t(SpiecEasi::clr(dta +1 , 1))S <-cor(dta)hclust_dta <-hclust(dist(t(dta)), method ="ward.D")hclust_dta <-hclust(as.dist(1-S^2), method ="ward.D")cut_dta <- stats::cutree(hclust_dta, 4)Matrix::image(as(S[order(cut_dta), order(cut_dta)],"sparseMatrix"), sub ="", xlab ="", ylab ="",colorkey =FALSE)```Empirical correlations in the gut data.:::The average correlations within blocks of variables belonging to thesame cluster are given below. We observe relatively high levels ofcorrelation in small blocks, similar to the simulation models used toevaluate the performance of clustering in the [Simulation Experiments]section.```{r echo = TRUE}C <-cor(dta)diag(C) <-0clusters <- cut_dtaseq_p <-1:length(clusters)L <-split(seq_p, factor(clusters))mat <-t(sapply(L,FUN =function(v) {summary(as.vector(C[v,v])) }))out <-data.frame(Cluster =1:4, "Mean correlation"=round(mat[, "Mean"], 3))knitr::kable(out)```<!-- : my table caption {#tbl-mylabel} -->We apply MGLasso to the normalised counts to infer a graph and aclustering structure. The graphs obtained by MGLasso for $\lambda_2 = 0$and for $\lambda_2 = 5$ (corresponding respectively $127$ and $80$clusters) are given below. In each case, the parameter $\lambda_1$ ischosen by stability selection (see [Model Selection] section).::: {#fig-real-data-mglasso}```{r, out.width="49%", out.height="20%", fig.show='hold',fig.align='center', echo = TRUE}knitr::include_graphics(c("figures/mglasso_tv0.png","figures/mglasso_full_graph_tv5.png"))```Infered graphs using MGLasso for different values of total variationpenalty.:::The variables were reordered according to the clustering partitionobtained from the distances between the regression vectors. Increasing$\lambda_2$ reduces the number of clusters and leads to a shrinkingeffect on the estimates. The adjacency matrix of the neighborhoodselection equivalent to setting $\lambda_2$ to $0$ is given in@fig-real-data-mglasso (up). In @fig-real-data-mglasso (down), thededuced partition is composed of $80$ clusters. A confusion matrixcomparing the edges deduced by MGLasso with $\lambda_2 = 5$ andneighborhood selection is given below. Adding a total variationparameter increases the merging effect, resulting in a larger number ofedges in the graph.| | | Neighborhood selection | ||---------------------------|-----------|:----------------------:|-------|| MGLasso ($\lambda_2 = 5$) | | non-edges | edges || | non-edges | 15678 | 0 || | edges | 288 | 163 |<!-- : my table caption {#tbl-mylabel} --><!-- @tbl-mylabel --># ConclusionWe proposed a new technique that combines Gaussian Graphical Modelinference and hierarchical clustering called MGLasso. The methodproceeds via convex optimization and minimizes the neighborhoodselection objective penalized by a hybrid regularization combining asparsity-inducing norm and a convex clustering penalty. We developed acomplete numerical scheme to apply MGLasso in practice, with anoptimization algorithm based on CONESTA and a model selection procedure.Our simulations results over synthetic and real datasets showed thatMGLasso can perform better than GLasso in network support recovery inthe presence of groups of correlated variables, and we illustrated themethod with the analysis of microbial associations data. The presentwork paves the way for future improvements: first, by incorporatingprior knowledge through more flexible weighted regularization; second,by studying the theoretical properties of the method in terms ofstatistical guarantees for the MGLasso estimator. <edmond> Moreover thenode-wise regression approach on which our method is based can beextended to a broader family of non-gaussian distributions belonging tothe exponential family as outlined by @yang2012graphical. </edmond># Session information {.appendix .unnumbered}```{r session-info, echo = TRUE}sessionInfo()```